[
https://issues.apache.org/jira/browse/SPARK-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14338897#comment-14338897
]
mike bowles commented on SPARK-1673:
------------------------------------
Some colleagues and I have a Spark version of glmnet working and have started
some discussion. Joseph Bradley suggested that we copy the discussion here in
order to keep track of it. Here's the discussion thread in the usual
last-first order. Besides myself, the thread involves Joseph and Debashish Das
who is working on the OWLQN implementation.
On Wed, Feb 25, 2015 at 10:35 AM, <[email protected]> wrote:
Hi Debasish,
Any method that generates point solutions to the minimization problem could
simply be run a number of times to generate the coefficient paths as a function
of the penalty parameter. I think the only issues are how easy the method is
to use and how much training and developer time is required to produce an
answer.
With regard to training time, Friedman says in his paper that they found
problems where glmnet would generate the entire coefficient path more rapidly
than sophisticated single point methods would generate single point solutions -
not all problems, but some problems. Ryan Tibshirani (Robert's son) who's a
professor and researcher at CMU in convex function optimization has echoed that
assertion for the particular case of the elasticnet penalty function (that's
from slides of his that are available online). So there's an open question
about the training speed that i believe we can answer in fairly short order.
I'm eager to explore that. Does OWLQN do a pass through the data for each
iteration? The linear version of GLMNET does not. On the other hand, OWLQN
may be able to take coarser steps through parameter space.
With regard to developer time, glmnet doesn't require the user to supply a
starting point for the penalty parameter. It calculates the starting point.
That makes it completely automatic. you've probably been through the process
of manually searching regularization parameter space with SVM. Pick out a set
of regularization parameter values like 10 raised to the (-2 through +5 in
steps of 1). See if there's a minimum in the range and if not shift to the
right or left. One of the reasons I pick up glmnet first for a new problem is
that you just drop in the training set and out pop the coefficient curves.
Usually the defaults work. One time out of 50 (or so) it doesn't converge. It
alerts you that it didn't converge and you change one parameter and rerun. If
you also drop in a test set then it even picks the optimum solution andproduces
an estimate of out-of-sample error.
We're going to make some speed/scaling runs on the synthetic data sets (in a
range of sizes) that are used in Spark for testing linear regression. We need
some wider data sets. Joseph mentioned some that we'll look at. I've got a
gene expression data set that's 30k wide by 15k tall. That takes a few hours
to train using R version of glmnet. We're also talking to some biology friends
to find other interesting data sets.
I really am eager to see the comparisons. And happy to help you tailor OWLQN
to generate coefficient paths. We might be able to produce a hybrid of
Friedman's algorithm using his basic algorithm outline but substituting OWLQN
for his round-robin coordinate descent. But i'm a little cocerned that it's
the round-robin coordinate descent that makes it possible to skip passing
through the full data set for 4 out of 5 iterations. We might be able to work
a way around that.
I'm just eager to have parallel versions of the tools available. I'll keep you
posted on our results. We should aim for running one another's code. I'll
check with my colleagues and see when we'll have something we can hand out.
We've delayed putting together a release version in favor of generating some
scaling results, as Joseph suggested. Discussions like this may have some
impact on what the release code looks like.
Mike
From: Debasish Das [mailto:[email protected]]
Sent: Wednesday, February 25, 2015 08:50 AM
To: 'Joseph Bradley'
Cc: [email protected], 'dev'
Subject: Re: Have Friedman's glmnet algo running in Spark
Any reason why the regularization path cannot be implemented using current
owlqn pr ?
We can change owlqn in breeze to fit your needs...
*From:* Joseph Bradley [mailto:[email protected]]
*Sent:* Sunday, February 22, 2015 06:48 PM
*To:* [email protected]
*Cc:* [email protected]
*Subject:* Re: Have Friedman's glmnet algo running in Spark
Hi Mike, glmnet has definitely been very successful, and it would be great to
see how we can improve optimization in MLlib! There is some related work
ongoing; here are the JIRAs: GLMNET implementation in Spark LinearRegression
with L1/L2 (elastic net) using OWLQN in new ML package The GLMNET JIRA has
actually been closed in favor of the latter JIRA. However, if you're getting
good results in your experiments, could you please post them on the GLMNET JIRA
and link them from the other JIRA? If it's faster and more scalable, that
would be great to find out. As far as where the code should go and the APIs,
that can be discussed on the JIRA. I hope this helps, and I'll keep an eye out
for updates on the JIRAs! Joseph
> GLMNET implementation in Spark
> ------------------------------
>
> Key: SPARK-1673
> URL: https://issues.apache.org/jira/browse/SPARK-1673
> Project: Spark
> Issue Type: New Feature
> Components: MLlib
> Reporter: Sung Chung
>
> This is a Spark implementation of GLMNET by Jerome Friedman, Trevor Hastie,
> Rob Tibshirani.
> http://www.jstatsoft.org/v33/i01/paper
> It's a straightforward implementation of the Coordinate-Descent based L1/L2
> regularized linear models, including Linear/Logistic/Multinomial regressions.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]