2011/11/8 Mathieu Blondel <[email protected]>:
> Hello,
>
> I was re-reading Tsuruoka's paper, based on which the SGDClassifier
> implements L1 regularization and found this interesting post (as
> usual?) by Bob Carpenter:
>
> http://lingpipe-blog.com/2009/09/18/tsuruoka-tsujii-ananiadou-2009-stochastic-gradient-descent-training-for-l1-regularized-log-linear-models-with-cumulative-penalty/
>
> He mentions a possible bug in the paper which if confirmed also
> affects SGDClassifier in scikit-learn: the lazy updates which are
> buffered towards the end of the training may well never be applied, so
> it's necessary to apply the L1 regularization routine one more time
> after the for loop to be sure that all buffered updates have been
> applied. This means that the solutions found by the current
> SGDClassifier may not be as sparse as they could be.

Hi Mathieu,

I'm aware of the issue - it seems to me that Bob is right but I can
hardly tell based on empirical evidence. Truncated gradient is quite a
crude procedure anyways - Olivier once suggested to use a projected
gradient approach instead (see J. Duchi's work). Anyways, the current
approach does a reasonable job to obtain a sparse solution with high
predictive performance (I compared mostly with liblinear tough).

BTW: presentation, code and errata of the paper can be found on the
authors website
 http://www.logos.ic.i.u-tokyo.ac.jp/~tsuruoka/

>
> As an aside, I'm not sure I fully grasp the difference between what
> the authors call "SGD-L1 (Clipped + Lazy-Update)" and the proposed
> cumulative penalty method. In the former case, since the weights won't
> take part in the inner product until the corresponding features are
> actually activated in an instance, we can just buffer the updates
> until we actually need to make them. In the cumulative penalty case,
> they basically do the same, i.e., they accumulate the penalties. The
> only potential difference I see is that they accumulate the penalties
> even if the weight is clipped to 0 (line 21 in Figure 2). Their method
> obtains much sparser solution than just "SGD-L1 (Clipped +
> Lazy-Update)" so I guess I'm missing something. Any explanation would
> be greatly appreciated.

Unfortunately, I'm not that familiar with "SGD-L1 (Clipped +
Lazy-Update)" either - I just quickly skimmed over a technical report
of Bob [1]. I agree with your description: it seems to me that the
major difference is the fact that the cumulative penalty approach
`remembers` the amount that has been clipped and applies it the next
time the feature becomes active. The "SGD-L1 (Clipped + Lazy-Update)"
approach discards the penalty that has been clipped.

[1] http://lingpipe.files.wordpress.com/2008/04/lazysgdregression.pdf

best,
 Peter

>
> Link to the paper:
> http://www.aclweb.org/anthology-new/P/P09/P09-1054.pdf
>
> Mathieu
>
> ------------------------------------------------------------------------------
> RSA(R) Conference 2012
> Save $700 by Nov 18
> Register now
> http://p.sf.net/sfu/rsa-sfdev2dev1
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>



-- 
Peter Prettenhofer

------------------------------------------------------------------------------
RSA(R) Conference 2012
Save $700 by Nov 18
Register now
http://p.sf.net/sfu/rsa-sfdev2dev1
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to