[jira] [Commented] (SPARK-1673) GLMNET implementation in Spark

mike bowles (JIRA) Thu, 12 Mar 2015 13:46:21 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359366#comment-14359366
 ]


mike bowles commented on SPARK-1673:
------------------------------------

Here's a table of scaling results for our implementation of glmnet regression.  
These are run locally on a 4-core server.  The data set is the higgs boson data 
set (available on aws).  We measured training times for various numbers of rows 
of data from 1000 to 10 million.  The attribute space is 28 variables wide.  We 
ran on 1 through 4 cores on the server.  

Training times (sec)
#rows   1-core  2-core  3-core  4-cores
100K    4.88    3.79    3.41    3.48
1M      20.5    10.6    9.51    8.45
5M      71.2    37.1    26.7    25.5
10M     155     70.5    59.7    49.7
The structure of the algorithm suggests that the training times should be 
linear in the number of rows and the test results bear that out.  Two cores 
shows a speedup of ~2 over one core and three cores shows ~2.6 over one core 
and four cores speeds up by ~3.11.  The four core result probably lags due to 
conflict with system function etc.  Running on AWS will make that clearer.  
That's in process now.

Our next steps are 
1.  run on some wider data sets
2.  run on larger cluster
3.  run OWLQN on the same problems in the same setting
4.  experiment with speedups - Joseph Bradley's approximation idea and cutting 
the number of data passes down by predicting what variables are going to become 
active instead of waiting until they do.  

> GLMNET implementation in Spark
> ------------------------------
>
>                 Key: SPARK-1673
>                 URL: https://issues.apache.org/jira/browse/SPARK-1673
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Sung Chung
>
> This is a Spark implementation of GLMNET by Jerome Friedman, Trevor Hastie, 
> Rob Tibshirani.
> http://www.jstatsoft.org/v33/i01/paper
> It's a straightforward implementation of the Coordinate-Descent based L1/L2 
> regularized linear models, including Linear/Logistic/Multinomial regressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1673) GLMNET implementation in Spark

Reply via email to