Hi Jason, I came up with these ideas when I worked on the project. I did not use external data to tune the penalty parameters. I found a way to do k-fold cross validation in the training phase.
For the dataset you mentioned, I think our algorithm can have similar performance. Right now, I did not find any publications that implement these ideas. I will post the link of my draft later. We can work on this project together if you are interested! Best, -Kun On Sun, Jun 30, 2013 at 1:43 PM, Jason Xin <[email protected]> wrote: > Michael, > > Interesting stuff. SAS High Performance Analytics (HPA) has HPREG that > already has a Partition statement that implements penalty by way of > introducing external data sets. Could you point to any publication to your > idea? I think HPREG has a mix of single pass with iterative. Will find out > more details. HPREG currently can finish a ~80GB fit data set (66MM obs > *280 columns) on a 32 worker node Hadoop cluster (1.5TB) in 2 minutes with > external data validation. Thanks. > > Jason Xin > > -----Original Message----- > From: Michael Kun Yang [mailto:[email protected]] > Sent: Sunday, June 30, 2013 3:36 PM > To: Timothy Mann > Cc: [email protected] > Subject: Re: single-pass algorithm for penalized linear regression with > cross validation > > Hi Timothy, > > Thank you for getting back! > > 1. I am willing to maintain the code, it is the part of work I should do > as a contributor. > 2. I have implemented a preliminary in the start-up I am working for. The > algorithm only uses lots of matrix computation. I did not code the > algorithm using Mahout's interface yet. > 3. The accelerate the computation, I use inner-mapper combiner. I test it > by MRUnit test and confirm the results with R packages. > 4. I am working on penalized linear regression for some time, I think it > is easy to find some applications. > > Hope I answer your questions. > > Best, > -Michael > > > On Sat, Jun 29, 2013 at 10:27 PM, Timothy Mann <[email protected] > >wrote: > > > Hi Michael, > > > > Your approach sounds useful and (in my opinion) fills an important gap > > in existing OSS machine learning libraries. I for one, would be > > interested in an efficient, parallel implementation of regularized > > regression. I'm not a contributor to Mahout, but the usual questions > > when someone wants to contribute an implemented algorithm seem to be: > > > > 1. Will you be willing and able (or know of someone who is willing and > > able) to maintain the code once it is integrated with Mahout? (mahout > > developers currently seem to be stretched a bit thin) 2. What is the > > state of the code? Is it already integrated with Mahout? > > What libraries does it depend on? Does it conform (or can it be fit) > > nicely to Mahout interfaces? How much work will it be (approximately)? > > 3. How has your implementation been tested? Do you know of a dataset > > that can be used for unit testing the framework? Is there a particular > > use case that is driving your implementation and development of this > algorithm? > > > > > > -King Tim > > > > > > On Jun 30, 2013 1:15 AM, "Michael Kun Yang" <[email protected]> > wrote: > > > >> Hello, > >> > >> I recently implemented a single pass algorithm for penalized linear > >> regression with cross validation in a big data start-up. I'd like to > >> contribute this to Mahout. > >> > >> Penalized linear regression such as Lasso, Elastic-net are widely > >> used in machine learning, but there are no very efficient scalable > >> implementations on MapReduce. > >> > >> The published distributed algorithms for solving this problem is > >> either iterative (which is not good for MapReduce, see Steven Boyd's > >> paper) or approximate (what if we need exact solutions, see > >> Paralleled stochastic gradient descent); another disadvantage of > >> these algorithms is they can not do cross validation in the training > >> phase, the user must provide a penalty parameter in advance. > >> > >> My ideas can train the model with cross validation in a simple pass. > >> They are based on some simple observations. I will post them on Arxiv > >> then share the link in the follow-up email. > >> > >> Any feedback would be helpful. > >> > >> Thanks > >> -Michael > >> > > > >
