I do confirm that Lasso and LassoLars both minimize 1/2n || y - Xw || + alpha ||w||_1
and that the n should not be present in the sparse coding context. it means : http://scikit-learn.org/stable/modules/linear_model.html#lasso is not correct. I don't know if this also affects the doc of the SGD. I would also vote for writing the cost function minimized in the Lasso (etc.) docstrings. regarding the shapes using sparse_encode I'll let Vlad comment. Alex On Tue, Dec 6, 2011 at 10:27 PM, David Warde-Farley <[email protected]> wrote: > On Tue, Dec 06, 2011 at 08:43:06PM +0100, Olivier Grisel wrote: >> 2011/12/6 David Warde-Farley <[email protected]>: >> > On Tue, Dec 06, 2011 at 09:04:22AM +0100, Alexandre Gramfort wrote: >> >> > This actually gets at something I've been meaning to fiddle with and >> >> > report but haven't had time: I'm not sure I completely trust the >> >> > coordinate descent implementation in scikit-learn, because it seems to >> >> > give me bogus answers a lot (i.e., the optimality conditions necessary >> >> > for it to be an actual solution are not even approximately satisfied). >> >> > Are you guys using something weird for the termination condition? >> >> >> >> can you give us a sample X and y that shows the pb? >> >> >> >> it should ultimately use the duality gap to stop the iterations but >> >> there might be a corner case … >> > >> > In [34]: rng = np.random.RandomState(0) >> > >> > In [35]: dictionary = rng.normal(size=(100, 500)) / 1000; dictionary /= >> > np.sqrt((dictionary ** 2).sum(axis=0)) >> > >> > In [36]: signal = rng.normal(size=100) / 1000 >> > >> > In [37]: from sklearn.linear_model import Lasso >> > >> > In [38]: lasso = Lasso(alpha=0.0001, max_iter=1e6, fit_intercept=False, >> > tol=1e-8) >> > >> > In [39]: lasso.fit(dictionary, signal) >> > Out[39]: >> > Lasso(alpha=0.0001, copy_X=True, fit_intercept=False, max_iter=1000000.0, >> > normalize=False, precompute='auto', tol=1e-08) >> > >> > In [40]: max(abs(lasso.coef_)) >> > Out[40]: 0.0 >> > >> > In [41]: from pylearn2.optimization.feature_sign import feature_sign_search >> > >> > In [42]: coef = feature_sign_search(dictionary, signal, 0.0001) >> > >> > In [43]: max(abs(coef)) >> > Out[43]: 0.0027295761244725018 >> > >> > And I'm pretty sure the latter result is the right one, since >> > >> > In [45]: def gradient(coefs): >> > ....: gram = np.dot(dictionary.T, dictionary) >> > ....: corr = np.dot(dictionary.T, signal) >> > ....: return - 2 * corr + 2 * np.dot(gram, coefs) + 0.0001 * >> > np.sign(coefs) >> > ....: >> >> Actually, alpha in scikit-learn is multiplied by n_samples. I agree >> this is misleading and not documented in the docstring. >> >> >>> lasso = Lasso(alpha=0.0001 / dictionary.shape[0], max_iter=1e6, >> >>> fit_intercept=False, tol=1e-8).fit(dictionary, signal) >> >>> max(abs(lasso.coef_)) >> 0.0027627270397484554 >> >>> max(abs(gradient(lasso.coef_))) >> 0.00019687294269977963 > > Seems like there's an added factor of 2 in there as well, > though this is a little more standard: > > In [94]: lasso = Lasso(alpha=0.0001 / (2 * dictionary.shape[0]), > max_iter=1e8, fit_intercept=False, tol=1e-8).fit(dictionary, signal) > > In [95]: coef = feature_sign_search(dictionary, signal, 0.0001) > In [96]: allclose(lasso.coef_, coef, atol=1e-7) > Out[96]: True > > I think you're right that the precise cost function definitely ought to be > documented in the front-facing classes rather than just the low-level Cython > routines. > > I also think that scaling the way Lasso/ElasticNet does in the context of > sparse coding may be very confusing, since in sparse coding it corresponds > not to a number of training samples in a regression problem but to the number > of input dimensions. > > The docstring of sparse_encode is quite confusing in that X, the dictionary, > says "n_samples, n_components". The number of samples (in the context of > sparse coding) should have no influence over the shape of the dictionary; > this seems to have leaked over from the Lasso documentation. > > The shape and mathematical definition of cov doesn't make much sense to me > given this change, though (or to begin with, for that matter): In the case of > a single problem, the desired covariance is X^T y, with y a column vector, > yielding another column vector of (n_components, 1). So the shape, if you > have multiple examples you're precomputing for, should end up being > (n_components, n_samples), and given the shape of Y that would be achieved by > X^T Y^T. > > David > > ------------------------------------------------------------------------------ > Cloud Services Checklist: Pricing and Packaging Optimization > This white paper is intended to serve as a reference, checklist and point of > discussion for anyone considering optimizing the pricing and packaging model > of a cloud services business. Read Now! > http://www.accelacomm.com/jaw/sfnl/114/51491232/ > _______________________________________________ > Scikit-learn-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general ------------------------------------------------------------------------------ Cloud Services Checklist: Pricing and Packaging Optimization This white paper is intended to serve as a reference, checklist and point of discussion for anyone considering optimizing the pricing and packaging model of a cloud services business. Read Now! http://www.accelacomm.com/jaw/sfnl/114/51491232/ _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
