I just sent the pr, fixed a typo in the comment. Add some comments and unit test. Please let me know if you receive the patch.
On Mon, Jan 6, 2014 at 9:18 PM, Michael Kun Yang <kuny...@stanford.edu>wrote: > I will follow up the newtown one later > > > On Mon, Jan 6, 2014 at 9:14 PM, Michael Kun Yang <kuny...@stanford.edu>wrote: > >> I just sent the pr for multinomial logistic regression. >> >> >> On Mon, Jan 6, 2014 at 6:26 PM, Michael Kun Yang <kuny...@stanford.edu>wrote: >> >>> Thanks, will do. >>> >>> >>> On Mon, Jan 6, 2014 at 6:21 PM, Reynold Xin <r...@databricks.com> wrote: >>> >>>> Thanks. Why don't you submit a pr and then we can work on it? >>>> >>>> > On Jan 6, 2014, at 6:15 PM, Michael Kun Yang <kuny...@stanford.edu> >>>> wrote: >>>> > >>>> > Hi Hossein, >>>> > >>>> > I can still use LabeledPoint with little modification. Currently I >>>> convert >>>> > the category into {0, 1} sequence, but I can do the conversion in the >>>> body >>>> > of methods or functions. >>>> > >>>> > In order to make the code run faster, I try not to use DoubleMatrix >>>> > abstraction to avoid memory allocation; another reason is that jblas >>>> has no >>>> > data structure to handle symmetric matrix addition efficiently. >>>> > >>>> > My code is not very pretty because I handle matrix operations >>>> manually (by >>>> > indexing). >>>> > >>>> > If you think it is ok, I will make a pull request. >>>> > >>>> > >>>> >> On Mon, Jan 6, 2014 at 5:34 PM, Hossein <fal...@gmail.com> wrote: >>>> >> >>>> >> Hi Michael, >>>> >> >>>> >> This sounds great. Would you please send these as a pull request. >>>> >> Especially if you can make your Newtown method implementation >>>> generic such >>>> >> that it can later be used by other algorithms, it would be very >>>> helpful. >>>> >> For example, you could add it as another optimization method under >>>> >> mllib/optimization. >>>> >> >>>> >> Was there a particular reason you chose not use LabeledPoint? >>>> >> >>>> >> We have some instructions for contributions here: < >>>> >> >>>> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark >>>> > >>>> >> >>>> >> Thanks, >>>> >> >>>> >> --Hossein >>>> >> >>>> >> >>>> >> On Mon, Jan 6, 2014 at 11:33 AM, Michael Kun Yang < >>>> kuny...@stanford.edu >>>> >>> wrote: >>>> >> >>>> >>> I actually have two versions: >>>> >>> one is based on gradient descent like the logistic regression on >>>> mllib. >>>> >>> the other is based on Newtown iteration, it is not as fast as SGD, >>>> but we >>>> >>> can get all the statistics from it like deviance, p-values and >>>> fisher >>>> >> info. >>>> >>> >>>> >>> we can get confusion matrix in both versions >>>> >>> >>>> >>> the gradient descent version is just a modification of logistic >>>> >> regression >>>> >>> with my own implementation. I did not use LabeledPoints class. >>>> >>> >>>> >>> >>>> >>> On Mon, Jan 6, 2014 at 11:13 AM, Evan Sparks <evan.spa...@gmail.com >>>> > >>>> >>> wrote: >>>> >>> >>>> >>>> Hi Michael, >>>> >>>> >>>> >>>> What strategy are you using to train the multinomial classifier? >>>> >>>> One-vs-all? I've got an optimized version of that method that I've >>>> been >>>> >>>> meaning to clean up and commit for a while. In particular, rather >>>> than >>>> >>>> shipping a (potentially very big) model with each map task, I ship >>>> it >>>> >>> once >>>> >>>> before each iteration with a broadcast variable. Perhaps we can >>>> compare >>>> >>>> versions and incorporate some of my optimizations into your code? >>>> >>>> >>>> >>>> Thanks, >>>> >>>> Evan >>>> >>>> >>>> >>>>>> On Jan 6, 2014, at 10:57 AM, Michael Kun Yang < >>>> kuny...@stanford.edu> >>>> >>>>> wrote: >>>> >>>>> >>>> >>>>> Hi Spark-ers, >>>> >>>>> >>>> >>>>> I implemented a SGD version of multinomial logistic regression >>>> based >>>> >> on >>>> >>>>> mllib's optimization package. If this classifier is in the future >>>> >> plan >>>> >>> of >>>> >>>>> mllib, I will be happy to contribute my code. >>>> >>>>> >>>> >>>>> Cheers >>>> >> >>>> >>> >>> >> >