2011/11/8 Mathieu Blondel <[email protected]>: > Hello, > > I was re-reading Tsuruoka's paper, based on which the SGDClassifier > implements L1 regularization and found this interesting post (as > usual?) by Bob Carpenter: > > http://lingpipe-blog.com/2009/09/18/tsuruoka-tsujii-ananiadou-2009-stochastic-gradient-descent-training-for-l1-regularized-log-linear-models-with-cumulative-penalty/ > > He mentions a possible bug in the paper which if confirmed also > affects SGDClassifier in scikit-learn: the lazy updates which are > buffered towards the end of the training may well never be applied, so > it's necessary to apply the L1 regularization routine one more time > after the for loop to be sure that all buffered updates have been > applied. This means that the solutions found by the current > SGDClassifier may not be as sparse as they could be.
Hi Mathieu, I'm aware of the issue - it seems to me that Bob is right but I can hardly tell based on empirical evidence. Truncated gradient is quite a crude procedure anyways - Olivier once suggested to use a projected gradient approach instead (see J. Duchi's work). Anyways, the current approach does a reasonable job to obtain a sparse solution with high predictive performance (I compared mostly with liblinear tough). BTW: presentation, code and errata of the paper can be found on the authors website http://www.logos.ic.i.u-tokyo.ac.jp/~tsuruoka/ > > As an aside, I'm not sure I fully grasp the difference between what > the authors call "SGD-L1 (Clipped + Lazy-Update)" and the proposed > cumulative penalty method. In the former case, since the weights won't > take part in the inner product until the corresponding features are > actually activated in an instance, we can just buffer the updates > until we actually need to make them. In the cumulative penalty case, > they basically do the same, i.e., they accumulate the penalties. The > only potential difference I see is that they accumulate the penalties > even if the weight is clipped to 0 (line 21 in Figure 2). Their method > obtains much sparser solution than just "SGD-L1 (Clipped + > Lazy-Update)" so I guess I'm missing something. Any explanation would > be greatly appreciated. Unfortunately, I'm not that familiar with "SGD-L1 (Clipped + Lazy-Update)" either - I just quickly skimmed over a technical report of Bob [1]. I agree with your description: it seems to me that the major difference is the fact that the cumulative penalty approach `remembers` the amount that has been clipped and applies it the next time the feature becomes active. The "SGD-L1 (Clipped + Lazy-Update)" approach discards the penalty that has been clipped. [1] http://lingpipe.files.wordpress.com/2008/04/lazysgdregression.pdf best, Peter > > Link to the paper: > http://www.aclweb.org/anthology-new/P/P09/P09-1054.pdf > > Mathieu > > ------------------------------------------------------------------------------ > RSA(R) Conference 2012 > Save $700 by Nov 18 > Register now > http://p.sf.net/sfu/rsa-sfdev2dev1 > _______________________________________________ > Scikit-learn-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > -- Peter Prettenhofer ------------------------------------------------------------------------------ RSA(R) Conference 2012 Save $700 by Nov 18 Register now http://p.sf.net/sfu/rsa-sfdev2dev1 _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
