[ https://issues.apache.org/jira/browse/SPARK-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14338897#comment-14338897 ]
mike bowles commented on SPARK-1673: ------------------------------------ Some colleagues and I have a Spark version of glmnet working and have started some discussion. Joseph Bradley suggested that we copy the discussion here in order to keep track of it. Here's the discussion thread in the usual last-first order. Besides myself, the thread involves Joseph and Debashish Das who is working on the OWLQN implementation. On Wed, Feb 25, 2015 at 10:35 AM, <m...@mbowles.com> wrote: Hi Debasish, Any method that generates point solutions to the minimization problem could simply be run a number of times to generate the coefficient paths as a function of the penalty parameter. I think the only issues are how easy the method is to use and how much training and developer time is required to produce an answer. With regard to training time, Friedman says in his paper that they found problems where glmnet would generate the entire coefficient path more rapidly than sophisticated single point methods would generate single point solutions - not all problems, but some problems. Ryan Tibshirani (Robert's son) who's a professor and researcher at CMU in convex function optimization has echoed that assertion for the particular case of the elasticnet penalty function (that's from slides of his that are available online). So there's an open question about the training speed that i believe we can answer in fairly short order. I'm eager to explore that. Does OWLQN do a pass through the data for each iteration? The linear version of GLMNET does not. On the other hand, OWLQN may be able to take coarser steps through parameter space. With regard to developer time, glmnet doesn't require the user to supply a starting point for the penalty parameter. It calculates the starting point. That makes it completely automatic. you've probably been through the process of manually searching regularization parameter space with SVM. Pick out a set of regularization parameter values like 10 raised to the (-2 through +5 in steps of 1). See if there's a minimum in the range and if not shift to the right or left. One of the reasons I pick up glmnet first for a new problem is that you just drop in the training set and out pop the coefficient curves. Usually the defaults work. One time out of 50 (or so) it doesn't converge. It alerts you that it didn't converge and you change one parameter and rerun. If you also drop in a test set then it even picks the optimum solution andproduces an estimate of out-of-sample error. We're going to make some speed/scaling runs on the synthetic data sets (in a range of sizes) that are used in Spark for testing linear regression. We need some wider data sets. Joseph mentioned some that we'll look at. I've got a gene expression data set that's 30k wide by 15k tall. That takes a few hours to train using R version of glmnet. We're also talking to some biology friends to find other interesting data sets. I really am eager to see the comparisons. And happy to help you tailor OWLQN to generate coefficient paths. We might be able to produce a hybrid of Friedman's algorithm using his basic algorithm outline but substituting OWLQN for his round-robin coordinate descent. But i'm a little cocerned that it's the round-robin coordinate descent that makes it possible to skip passing through the full data set for 4 out of 5 iterations. We might be able to work a way around that. I'm just eager to have parallel versions of the tools available. I'll keep you posted on our results. We should aim for running one another's code. I'll check with my colleagues and see when we'll have something we can hand out. We've delayed putting together a release version in favor of generating some scaling results, as Joseph suggested. Discussions like this may have some impact on what the release code looks like. Mike From: Debasish Das [mailto:debasish.da...@gmail.com] Sent: Wednesday, February 25, 2015 08:50 AM To: 'Joseph Bradley' Cc: m...@mbowles.com, 'dev' Subject: Re: Have Friedman's glmnet algo running in Spark Any reason why the regularization path cannot be implemented using current owlqn pr ? We can change owlqn in breeze to fit your needs... *From:* Joseph Bradley [mailto:jos...@databricks.com] *Sent:* Sunday, February 22, 2015 06:48 PM *To:* m...@mbowles.com *Cc:* d...@spark.apache.org *Subject:* Re: Have Friedman's glmnet algo running in Spark Hi Mike, glmnet has definitely been very successful, and it would be great to see how we can improve optimization in MLlib! There is some related work ongoing; here are the JIRAs: GLMNET implementation in Spark LinearRegression with L1/L2 (elastic net) using OWLQN in new ML package The GLMNET JIRA has actually been closed in favor of the latter JIRA. However, if you're getting good results in your experiments, could you please post them on the GLMNET JIRA and link them from the other JIRA? If it's faster and more scalable, that would be great to find out. As far as where the code should go and the APIs, that can be discussed on the JIRA. I hope this helps, and I'll keep an eye out for updates on the JIRAs! Joseph > GLMNET implementation in Spark > ------------------------------ > > Key: SPARK-1673 > URL: https://issues.apache.org/jira/browse/SPARK-1673 > Project: Spark > Issue Type: New Feature > Components: MLlib > Reporter: Sung Chung > > This is a Spark implementation of GLMNET by Jerome Friedman, Trevor Hastie, > Rob Tibshirani. > http://www.jstatsoft.org/v33/i01/paper > It's a straightforward implementation of the Coordinate-Descent based L1/L2 > regularized linear models, including Linear/Logistic/Multinomial regressions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org