On Fri, Oct 21, 2011 at 09:47, Andreas Mueller <[email protected]> wrote:
> Thanks for the quick anser Alexandre.
>
>
>> On Fri, Oct 21, 2011 at 09:24, Andreas Mueller<[email protected]>  
>> wrote:
>>> Hi everybody.
>>> I have a question about the implementation of SGD. As far as I can tell,
>>> it follows Leon Bottou's work while using the learning rate from Pegasos.
>>> As far as I can tell, a difference between Bottou's SGD and Shwartz's
>>> Pegasos is the projection step in Pegasos that enforces the
>>> regularization constrains (if I understood correctly).
>>> The authors claim that this is an important part of their algorithm.
>> If I recall correctly in their own code the projection step is almost
>> always commented out. The really important part of the algorithm is
>> the learning rate scaled by the strong convexity constant.
>>
>> When I implemented pegasos I found out that the projection step made
>> no difference at all, and hence also commented it out.
>>
> That seems pretty weired, given that they presented the projection
> as an integral part of their algorithm in the first paper.
> In the journal paper, though, they present it as an variation of the basic
> algorithm, which supports your experience.
>
> Did you do your experiments on text data? Or did you also try some
> dense datasets?

I used text data, but as I didn't do any optimizations the code should
be ok for dense data as well.

>>> What was the reason to favour the version of the algorithm without
>>> the projection step? Has anyone done any experiments on comparing
>>> the different SGD approaches?
>>> I am trying to get into this a bit more and would love to understand
>>> the differences.
>>>
>>> On a related topic: Has any one any experience in using SGD
>>> for kernelized SVMs? There is the LASVM by Bottou and
>>> Pegasos can also do kernelized classification.
>>> Would it be worth including this in sklearn?
>> I've implemented this in the past, and kernelized pegasos was always
>> far too slow to be usable, as predicting on a new data point involves
>> computing the kernel between this data point and every single other
>> point on which an update has ever happenned. LaSVM is much faster
>> because it is very clever about keeping its support set small, and it
>> might be worth implementing. I should have inneficient pure-python
>> code for it lying around somewhere.
>>
>>
> That seems to be consistent with what they report. I haven't
> looked into LaSVM much but it sounds interesting.
> Again, have you tried sparse or dense data? And do you
> think it makes a difference?
>
> Is LaSVM tricky to implement? I might give it to a Student to do on cuda ;)
>

It's not that tricky to implement but it's also not very
CUDA-friendly, as it has far more ifs and whatnot than straight-up
vector computations (although you could conceivably use CUDA to
compute the kernels).

-- 
 - Alexandre

------------------------------------------------------------------------------
The demand for IT networking professionals continues to grow, and the
demand for specialized networking skills is growing even more rapidly.
Take a complimentary Learning@Cisco Self-Assessment and learn 
about Cisco certifications, training, and career opportunities. 
http://p.sf.net/sfu/cisco-dev2dev
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to