Hi Jason,

I came up with these ideas when I worked on the project. I did not use
external data to tune the penalty parameters. I found a way to do k-fold
cross validation in the training phase.

For the dataset you mentioned, I think our algorithm can have similar
performance.

Right now, I did not find any publications that implement these ideas. I
will post the link of my draft later.

We can work on this project together if you are interested!

Best,
-Kun


On Sun, Jun 30, 2013 at 1:43 PM, Jason Xin <[email protected]> wrote:

> Michael,
>
> Interesting stuff. SAS High Performance Analytics (HPA) has HPREG that
> already has a Partition statement that implements penalty by way of
> introducing external data sets. Could you point to any publication to your
> idea? I think HPREG has a mix of single pass with iterative. Will find out
> more details. HPREG currently can finish a ~80GB fit data set (66MM obs
> *280 columns) on a 32 worker node Hadoop cluster (1.5TB) in 2 minutes with
> external data validation. Thanks.
>
> Jason Xin
>
> -----Original Message-----
> From: Michael Kun Yang [mailto:[email protected]]
> Sent: Sunday, June 30, 2013 3:36 PM
> To: Timothy Mann
> Cc: [email protected]
> Subject: Re: single-pass algorithm for penalized linear regression with
> cross validation
>
> Hi Timothy,
>
> Thank you for getting back!
>
> 1. I am willing to maintain the code, it is the part of work I should do
> as a contributor.
> 2. I have implemented a preliminary in the start-up I am working for. The
> algorithm only uses lots of matrix computation. I did not code the
> algorithm using Mahout's interface yet.
> 3. The accelerate the computation, I use inner-mapper combiner. I test it
> by MRUnit test and confirm the results with R packages.
> 4. I am working on penalized linear regression for some time, I think it
> is easy to find some applications.
>
> Hope I answer your questions.
>
> Best,
> -Michael
>
>
> On Sat, Jun 29, 2013 at 10:27 PM, Timothy Mann <[email protected]
> >wrote:
>
> > Hi Michael,
> >
> > Your approach sounds useful and (in my opinion) fills an important gap
> > in existing OSS machine learning libraries. I for one, would be
> > interested in an efficient, parallel implementation of regularized
> > regression. I'm not a contributor to Mahout, but the usual questions
> > when someone wants to contribute an implemented algorithm seem to be:
> >
> > 1. Will you be willing and able (or know of someone who is willing and
> > able) to maintain the code once it is integrated with Mahout? (mahout
> > developers currently seem to be stretched a bit thin) 2. What is the
> > state of the code? Is it already integrated with Mahout?
> > What libraries does it depend on? Does it conform (or can it be fit)
> > nicely to Mahout interfaces? How much work will it be (approximately)?
> > 3. How has your implementation been tested? Do you know of a dataset
> > that can be used for unit testing the framework? Is there a particular
> > use case that is driving your implementation and development of this
> algorithm?
> >
> >
> > -King Tim
> >
> >
> > On Jun 30, 2013 1:15 AM, "Michael Kun Yang" <[email protected]>
> wrote:
> >
> >> Hello,
> >>
> >> I recently implemented a single pass algorithm for penalized linear
> >> regression with cross validation in a big data start-up. I'd like to
> >> contribute this to Mahout.
> >>
> >> Penalized linear regression such as Lasso, Elastic-net are widely
> >> used in machine learning, but there are no very efficient scalable
> >> implementations on MapReduce.
> >>
> >> The published distributed algorithms for solving this problem is
> >> either iterative (which is not good for MapReduce, see Steven Boyd's
> >> paper) or approximate (what if we need exact solutions, see
> >> Paralleled stochastic gradient descent); another disadvantage of
> >> these algorithms is they can not do cross validation in the training
> >> phase, the user must provide a penalty parameter in advance.
> >>
> >> My ideas can train the model with cross validation in a simple pass.
> >> They are based on some simple observations. I will post them on Arxiv
> >> then share the link in the follow-up email.
> >>
> >> Any feedback would be helpful.
> >>
> >> Thanks
> >> -Michael
> >>
> >
>
>

Reply via email to