Joseph, Thanks for your reply. We'll take the steps you suggest - generate some timing comparisons and post them in the GLMNET JIRA with a link from the OWLQN JIRA.
We've got the regression version of GLMNET programmed. The regression version only requires a pass through the data each time the active set of coefficients changes. That's usualy less than or equal to the number of decrements in the penalty coefficient (typical default = 100). The intermediate iterations can be done using results of previous passes through the full data set. We're expecting the number of data passes will be independent of either number of rows or columns in the data set. We're eager to demonstrate this scaling. Do you have any suggestions regarding data sets for large scale regression problems? It would be nice to demonstrate scaling for both number of rows and number of columns. Thanks for your help. Mike -----Original Message----- From: Joseph Bradley [mailto:jos...@databricks.com] Sent: Sunday, February 22, 2015 06:48 PM To: m...@mbowles.com Cc: dev@spark.apache.org Subject: Re: Have Friedman's glmnet algo running in Spark Hi Mike,glmnet has definitely been very successful, and it would be great to seehow we can improve optimization in MLlib! There is some related workongoing; here are the JIRAs:GLMNET implementation in SparkLinearRegression with L1/L2 (elastic net) using OWLQN in new ML packageThe GLMNET JIRA has actually been closed in favor of the latter JIRA.However, if you're getting good results in your experiments, could youplease post them on the GLMNET JIRA and link them from the other JIRA? Ifit's faster and more scalable, that would be great to find out.As far as where the code should go and the APIs, that can be discussed onthe JIRA.I hope this helps, and I'll keep an eye out for updates on the JIRAs!JosephOn Thu, Feb 19, 2015 at 10:59 AM, wrote:> Dev List,> A couple of colleagues and I have gotten several versions of glmnet algo> coded and running on Spark RDD. glmnet algo (> http://www.jstatsoft.org/v33/i01/paper) is a very fast algorithm for> generating coefficient paths solving penalized regression with elastic net> penalties. The algorithm runs fast by taking an approach that generates> solutions for a wide variety of penalty parameter. We're able to integrate> into Mllib class structure a couple of different ways. The algorithm may> fit better into the new pipeline structure since it naturally returns a> multitide of models (corresponding to different vales of penalty> parameters). That appears to fit better into pipeline than Mllib linear> regression (for example).>> We've got regression running with the speed optimizations that Friedman> recommends. We'll start working on the logistic regression version next.>> We're eager to make the code available as open source and would like to> get some feedback about how best to do that. Any thoughts?> Mike Bowles.>>>