[ 
https://issues.apache.org/jira/browse/SPARK-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14338897#comment-14338897
 ] 

mike bowles commented on SPARK-1673:
------------------------------------

Some colleagues and I have a Spark version of glmnet working and have started 
some discussion.  Joseph Bradley suggested that we copy the discussion here in 
order to keep track of it.  Here's the discussion thread in the usual 
last-first order.  Besides myself, the thread involves Joseph and Debashish Das 
who is working on the OWLQN implementation.  

On Wed, Feb 25, 2015 at 10:35 AM, <m...@mbowles.com> wrote:

Hi Debasish,
Any method that generates point solutions to the minimization problem could 
simply be run a number of times to generate the coefficient paths as a function 
of the penalty parameter.  I think the only issues are how easy the method is 
to use and how much training and developer time is required to produce an 
answer. 

With regard to training time, Friedman says in his paper that they found 
problems where glmnet would generate the entire coefficient path more rapidly 
than sophisticated single point methods would generate single point solutions - 
not all problems, but some problems.  Ryan Tibshirani (Robert's son) who's a 
professor and researcher at CMU in convex function optimization has echoed that 
assertion for the particular case of the elasticnet penalty function (that's 
from slides of his that are available online).  So there's an open question 
about the training speed that i believe we can answer in fairly short order.  
I'm eager to explore that.  Does OWLQN do a pass through the data for each 
iteration?  The linear version of GLMNET does not.  On the other hand, OWLQN 
may be able to take coarser steps through parameter space.  

With regard to developer time, glmnet doesn't require the user to supply a 
starting point for the penalty parameter.  It calculates the starting point.  
That makes it completely automatic.  you've probably been through the process 
of manually searching regularization parameter space with SVM.  Pick out a set 
of regularization parameter values like 10 raised to the (-2 through +5 in 
steps of 1).  See if there's a minimum in the range and if not shift to the 
right or left.  One of the reasons I pick up glmnet first for a new problem is 
that you just drop in the training set and out pop the coefficient curves.  
Usually the defaults work.  One time out of 50 (or so) it doesn't converge.  It 
alerts you that it didn't converge and you change one parameter and rerun.  If 
you also drop in a test set then it even picks the optimum solution andproduces 
an estimate of out-of-sample error.  

We're going to make some speed/scaling runs on the synthetic data sets (in a 
range of sizes) that are used in Spark for testing linear regression.  We need 
some wider data sets.  Joseph mentioned some that we'll look at.  I've got a 
gene expression data set that's 30k wide by 15k tall.  That takes a few hours 
to train using R version of glmnet.  We're also talking to some biology friends 
to find other interesting data sets. 

I really am eager to see the comparisons.  And happy to help you tailor OWLQN 
to generate coefficient paths.  We might be able to produce a hybrid of 
Friedman's algorithm using his basic algorithm outline but substituting OWLQN 
for his round-robin coordinate descent.  But i'm a little cocerned that it's 
the round-robin coordinate descent that makes it possible to skip passing 
through the full data set for 4 out of 5 iterations.  We might be able to work 
a way around that. 

I'm just eager to have parallel versions of the tools available.  I'll keep you 
posted on our results.  We should aim for running one another's code.  I'll 
check with my colleagues and see when we'll have something we can hand out.  
We've delayed putting together a release version in favor of generating some 
scaling results, as Joseph suggested.  Discussions like this may have some 
impact on what the release code looks like. 
Mike

From: Debasish Das [mailto:debasish.da...@gmail.com]
Sent: Wednesday, February 25, 2015 08:50 AM
To: 'Joseph Bradley'
Cc: m...@mbowles.com, 'dev'
Subject: Re: Have Friedman's glmnet algo running in Spark

Any reason why the regularization path cannot be implemented using current 
owlqn pr ?

We can change owlqn in breeze to fit your needs...

*From:* Joseph Bradley [mailto:jos...@databricks.com]
*Sent:* Sunday, February 22, 2015 06:48 PM
*To:* m...@mbowles.com
*Cc:* d...@spark.apache.org
 *Subject:* Re: Have Friedman's glmnet algo running in Spark
 Hi Mike, glmnet has definitely been very successful, and it would be great to 
see how we can improve optimization in MLlib! There is some related work  
ongoing; here are the JIRAs: GLMNET implementation in Spark LinearRegression 
with L1/L2 (elastic net) using OWLQN in new ML package The GLMNET JIRA has 
actually been closed in favor of the latter JIRA.   However, if you're getting 
good results in your experiments, could you please post them on the GLMNET JIRA 
and link them from the other JIRA? If  it's faster and more scalable, that 
would be great to find out. As far as where the code should go and the APIs, 
that can be discussed on the JIRA. I hope this helps, and I'll keep an eye out 
for updates on the JIRAs! Joseph

> GLMNET implementation in Spark
> ------------------------------
>
>                 Key: SPARK-1673
>                 URL: https://issues.apache.org/jira/browse/SPARK-1673
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Sung Chung
>
> This is a Spark implementation of GLMNET by Jerome Friedman, Trevor Hastie, 
> Rob Tibshirani.
> http://www.jstatsoft.org/v33/i01/paper
> It's a straightforward implementation of the Coordinate-Descent based L1/L2 
> regularized linear models, including Linear/Logistic/Multinomial regressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to