On Fri, Oct 21, 2011 at 09:47, Andreas Mueller <[email protected]> wrote: > Thanks for the quick anser Alexandre. > > >> On Fri, Oct 21, 2011 at 09:24, Andreas Mueller<[email protected]> >> wrote: >>> Hi everybody. >>> I have a question about the implementation of SGD. As far as I can tell, >>> it follows Leon Bottou's work while using the learning rate from Pegasos. >>> As far as I can tell, a difference between Bottou's SGD and Shwartz's >>> Pegasos is the projection step in Pegasos that enforces the >>> regularization constrains (if I understood correctly). >>> The authors claim that this is an important part of their algorithm. >> If I recall correctly in their own code the projection step is almost >> always commented out. The really important part of the algorithm is >> the learning rate scaled by the strong convexity constant. >> >> When I implemented pegasos I found out that the projection step made >> no difference at all, and hence also commented it out. >> > That seems pretty weired, given that they presented the projection > as an integral part of their algorithm in the first paper. > In the journal paper, though, they present it as an variation of the basic > algorithm, which supports your experience. > > Did you do your experiments on text data? Or did you also try some > dense datasets?
I used text data, but as I didn't do any optimizations the code should be ok for dense data as well. >>> What was the reason to favour the version of the algorithm without >>> the projection step? Has anyone done any experiments on comparing >>> the different SGD approaches? >>> I am trying to get into this a bit more and would love to understand >>> the differences. >>> >>> On a related topic: Has any one any experience in using SGD >>> for kernelized SVMs? There is the LASVM by Bottou and >>> Pegasos can also do kernelized classification. >>> Would it be worth including this in sklearn? >> I've implemented this in the past, and kernelized pegasos was always >> far too slow to be usable, as predicting on a new data point involves >> computing the kernel between this data point and every single other >> point on which an update has ever happenned. LaSVM is much faster >> because it is very clever about keeping its support set small, and it >> might be worth implementing. I should have inneficient pure-python >> code for it lying around somewhere. >> >> > That seems to be consistent with what they report. I haven't > looked into LaSVM much but it sounds interesting. > Again, have you tried sparse or dense data? And do you > think it makes a difference? > > Is LaSVM tricky to implement? I might give it to a Student to do on cuda ;) > It's not that tricky to implement but it's also not very CUDA-friendly, as it has far more ifs and whatnot than straight-up vector computations (although you could conceivably use CUDA to compute the kernels). -- - Alexandre ------------------------------------------------------------------------------ The demand for IT networking professionals continues to grow, and the demand for specialized networking skills is growing even more rapidly. Take a complimentary Learning@Cisco Self-Assessment and learn about Cisco certifications, training, and career opportunities. http://p.sf.net/sfu/cisco-dev2dev _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
