I do confirm that Lasso and LassoLars both minimize

1/2n || y - Xw || + alpha ||w||_1

and that the n should not be present in the sparse coding context.

it means :

http://scikit-learn.org/stable/modules/linear_model.html#lasso

is not correct. I don't know if this also affects the doc of the SGD.
I would also vote for writing the cost function minimized in the Lasso
(etc.) docstrings.

regarding the shapes using sparse_encode I'll let Vlad comment.

Alex

On Tue, Dec 6, 2011 at 10:27 PM, David Warde-Farley
<[email protected]> wrote:
> On Tue, Dec 06, 2011 at 08:43:06PM +0100, Olivier Grisel wrote:
>> 2011/12/6 David Warde-Farley <[email protected]>:
>> > On Tue, Dec 06, 2011 at 09:04:22AM +0100, Alexandre Gramfort wrote:
>> >> > This actually gets at something I've been meaning to fiddle with and 
>> >> > report but haven't had time: I'm not sure I completely trust the 
>> >> > coordinate descent implementation in scikit-learn, because it seems to 
>> >> > give me bogus answers a lot (i.e., the optimality conditions necessary 
>> >> > for it to be an actual solution are not even approximately satisfied). 
>> >> > Are you guys using something weird for the termination condition?
>> >>
>> >> can you give us a sample X and y that shows the pb?
>> >>
>> >> it should ultimately use the duality gap to stop the iterations but
>> >> there might be a corner case …
>> >
>> > In [34]: rng = np.random.RandomState(0)
>> >
>> > In [35]: dictionary = rng.normal(size=(100, 500)) / 1000; dictionary /=
>> > np.sqrt((dictionary ** 2).sum(axis=0))
>> >
>> > In [36]: signal = rng.normal(size=100) / 1000
>> >
>> > In [37]: from sklearn.linear_model import Lasso
>> >
>> > In [38]: lasso = Lasso(alpha=0.0001, max_iter=1e6, fit_intercept=False,
>> > tol=1e-8)
>> >
>> > In [39]: lasso.fit(dictionary, signal)
>> > Out[39]:
>> > Lasso(alpha=0.0001, copy_X=True, fit_intercept=False, max_iter=1000000.0,
>> >   normalize=False, precompute='auto', tol=1e-08)
>> >
>> > In [40]: max(abs(lasso.coef_))
>> > Out[40]: 0.0
>> >
>> > In [41]: from pylearn2.optimization.feature_sign import feature_sign_search
>> >
>> > In [42]: coef = feature_sign_search(dictionary, signal, 0.0001)
>> >
>> > In [43]: max(abs(coef))
>> > Out[43]: 0.0027295761244725018
>> >
>> > And I'm pretty sure the latter result is the right one, since
>> >
>> > In [45]: def gradient(coefs):
>> >   ....:     gram = np.dot(dictionary.T, dictionary)
>> >   ....:     corr = np.dot(dictionary.T, signal)
>> >   ....:     return - 2 * corr + 2 * np.dot(gram, coefs) + 0.0001 *
>> > np.sign(coefs)
>> >   ....:
>>
>> Actually, alpha in scikit-learn is multiplied by n_samples. I agree
>> this is misleading and not documented in the docstring.
>>
>> >>> lasso = Lasso(alpha=0.0001 / dictionary.shape[0], max_iter=1e6, 
>> >>> fit_intercept=False, tol=1e-8).fit(dictionary, signal)
>> >>> max(abs(lasso.coef_))
>> 0.0027627270397484554
>> >>> max(abs(gradient(lasso.coef_)))
>> 0.00019687294269977963
>
> Seems like there's an added factor of 2 in there as well,
> though this is a little more standard:
>
> In [94]: lasso = Lasso(alpha=0.0001 / (2 * dictionary.shape[0]),
> max_iter=1e8, fit_intercept=False, tol=1e-8).fit(dictionary, signal)
>
> In [95]: coef = feature_sign_search(dictionary, signal, 0.0001)
> In [96]: allclose(lasso.coef_, coef, atol=1e-7)
> Out[96]: True
>
> I think you're right that the precise cost function definitely ought to be
> documented in the front-facing classes rather than just the low-level Cython
> routines.
>
> I also think that scaling the way Lasso/ElasticNet does in the context of
> sparse coding may be very confusing, since in sparse coding it corresponds
> not to a number of training samples in a regression problem but to the number
> of input dimensions.
>
> The docstring of sparse_encode is quite confusing in that X, the dictionary,
> says "n_samples, n_components". The number of samples (in the context of
> sparse coding) should have no influence over the shape of the dictionary;
> this seems to have leaked over from the Lasso documentation.
>
> The shape and mathematical definition of cov doesn't make much sense to me
> given this change, though (or to begin with, for that matter): In the case of
> a single problem, the desired covariance is X^T y, with y a column vector,
> yielding another column vector of (n_components, 1). So the shape, if you
> have multiple examples you're precomputing for, should end up being
> (n_components, n_samples), and given the shape of Y that would be achieved by
> X^T Y^T.
>
> David
>
> ------------------------------------------------------------------------------
> Cloud Services Checklist: Pricing and Packaging Optimization
> This white paper is intended to serve as a reference, checklist and point of
> discussion for anyone considering optimizing the pricing and packaging model
> of a cloud services business. Read Now!
> http://www.accelacomm.com/jaw/sfnl/114/51491232/
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Cloud Services Checklist: Pricing and Packaging Optimization
This white paper is intended to serve as a reference, checklist and point of 
discussion for anyone considering optimizing the pricing and packaging model 
of a cloud services business. Read Now!
http://www.accelacomm.com/jaw/sfnl/114/51491232/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to