Re: [Scikit-learn-general] Possible bug in SGD with L1 regularization and question

Adrien Thu, 10 Nov 2011 03:20:32 -0800

Le 09/11/2011 14:32, Mathieu Blondel a écrit :
> On Wed, Nov 9, 2011 at 4:37 AM, Peter Prettenhofer
> <[email protected]>  wrote:
>
>> I'm aware of the issue - it seems to me that Bob is right but I can
>> hardly tell based on empirical evidence. Truncated gradient is quite a
>> crude procedure anyways - Olivier once suggested to use a projected
>> gradient approach instead (see J. Duchi's work). Anyways, the current
>> approach does a reasonable job to obtain a sparse solution with high
>> predictive performance (I compared mostly with liblinear tough).
> For me, Duchi's L1 ball projection method is only useful if your data
> is dense because their basic algorithm has a linear dependency on the
> number of features (zero or not). They also developed an algorithm
> using red-black trees with a logarithmic dependency on the number of
> features but it looks really complicated. In comparison, the
> unconstrained formulations like Carpenter's and Langford's only depend
> on the number of *non-zero* features thanks to the lazy-update trick.
This might be a bit off-topic, but I am not sure the linear complexity 
of Duchi's algorithm (simple version) is such a big issue in practice 
(and practicality beats purity, right ;-)).


For my own needs (projected gradient descent), I quickly implemented it 
here: https://gist.github.com/1272551 (I tested it against Duchi's own 
Matlab code).

Despite its linear complexity, it has the nice perk of being super 
simple and scales easily to dense vectors with millions of dimensions. 
The bottleneck is probably in computing the cumulative sums of the 
sorted vector, which is already quite fast. It could be even faster with 
an ad-hoc implementation for sparse vectors if necessary, but my guess 
is that you would need really huge and very sparse vectors for it to be 
worth your while.

My 1.99 cents (not really worth 2).

Adrien

> D. Sculley proposed a simpler algorithm based on a binary search in
> his web scale k-means paper. The linear-time algorithm of Duchi is
> more complicated. Again, it makes more sense if your data is dense
> (which is the case for cluster centers).
>
>> BTW: presentation, code and errata of the paper can be found on the
>> authors website
>>   http://www.logos.ic.i.u-tokyo.ac.jp/~tsuruoka/
> Good to know, I'll have a look later.
>
> Mathieu
>
> ------------------------------------------------------------------------------
> RSA(R) Conference 2012
> Save $700 by Nov 18
> Register now
> http://p.sf.net/sfu/rsa-sfdev2dev1
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


------------------------------------------------------------------------------
RSA(R) Conference 2012
Save $700 by Nov 18
Register now
http://p.sf.net/sfu/rsa-sfdev2dev1
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Possible bug in SGD with L1 regularization and question

Reply via email to