[ https://issues.apache.org/jira/browse/SPARK-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359366#comment-14359366 ]
mike bowles commented on SPARK-1673: ------------------------------------ Here's a table of scaling results for our implementation of glmnet regression. These are run locally on a 4-core server. The data set is the higgs boson data set (available on aws). We measured training times for various numbers of rows of data from 1000 to 10 million. The attribute space is 28 variables wide. We ran on 1 through 4 cores on the server. Training times (sec) #rows 1-core 2-core 3-core 4-cores 100K 4.88 3.79 3.41 3.48 1M 20.5 10.6 9.51 8.45 5M 71.2 37.1 26.7 25.5 10M 155 70.5 59.7 49.7 The structure of the algorithm suggests that the training times should be linear in the number of rows and the test results bear that out. Two cores shows a speedup of ~2 over one core and three cores shows ~2.6 over one core and four cores speeds up by ~3.11. The four core result probably lags due to conflict with system function etc. Running on AWS will make that clearer. That's in process now. Our next steps are 1. run on some wider data sets 2. run on larger cluster 3. run OWLQN on the same problems in the same setting 4. experiment with speedups - Joseph Bradley's approximation idea and cutting the number of data passes down by predicting what variables are going to become active instead of waiting until they do. > GLMNET implementation in Spark > ------------------------------ > > Key: SPARK-1673 > URL: https://issues.apache.org/jira/browse/SPARK-1673 > Project: Spark > Issue Type: New Feature > Components: MLlib > Reporter: Sung Chung > > This is a Spark implementation of GLMNET by Jerome Friedman, Trevor Hastie, > Rob Tibshirani. > http://www.jstatsoft.org/v33/i01/paper > It's a straightforward implementation of the Coordinate-Descent based L1/L2 > regularized linear models, including Linear/Logistic/Multinomial regressions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org