John, you're right about the difference in nomenclature. I've been using
scikit-learn's names for the parameters, so the alpha I've referred to
is the regularization strength and corresponds to lambda in glmnet. The
mixing parameter, referred to in glmnet as alpha, is the L1-ratio in
scikit-learn.
Nick, thank you very much for the tip on how the L1 norm of an OLS
solution is used to determine the maximum regularization strength for
lasso. Thinking about how that would extend to elastic net: with an
L1-ratio of 1, alpha_max is the L1 norm of an OLS solution, because
elastic net reduces to lasso in this case. But with L1-ratios between
zero and one, couldn't alpha_max be greater than the L1 norm of an OLS
solution since alpha_max for the elastic net is not the L1
regularization strength, but rather the overall regularization strength,
distributed between L1 and L2? As the ElasticNet documentation says,
alpha = L1 strength + L2 strength, and L1-ratio= L1 strength / (L1
strength + L2 strength). It seems like the alpha_max for elastic net
with a given L1-ratio could be some function of both the L1 and L2 norms
of an OLS solution, and it might be a simple combination. But I haven't
found it browsing the literature, and I am unsure of how to derive it.
I did find the part in coordinate_descent.py where alpha_max is chosen,
but I don't fully understand the reasoning behind it:
alpha_max = np.abs(Xy).max() / (n_samples * l1_ratio)
Another concern: if the data does not have mean zero and/or unit
variance (I've been told this might be ok if, for example, I want to
preserve sparsity in the input), might this affect the magnitude of the
solution coefficients and hence the calculation of alpha_max?
And I'm still not sure how to pick the smallest value of alpha (or
rather "eps," the ratio between the largest and smallest values).
Now for the L1-ratio. The ElasticNetCV class does not automatically
choose a set of L1-ratios to test, as it does with the alphas; it's up
to the user to supply them. However, it does mention in the
documentation for ElasticNetCV:
/Note that a good choice of list of values for l1_ratio is often to
put more values close to 1 (i.e. Lasso) and less close to 0 (i.e.
Ridge), as in //[.1,////.5,////.7,////.9,////.95,////.99,////1]/
I understand John's reasoning that good L1-ratios are likely to be
higher the greater the proportion of variables to samples. If anyone
knows of other considerations that could go into choosing an appropriate
set of L1-ratios, let me know.
Lastly: I was excited about the idea of trying first with a sparse grid
and then repeating the search in more detail in the area of parameter
values yielding high cross-validation scores. However, I notice in the
paper associated with Nick's link that it says "In practice, an upper
bound must be selected for any grid-search optimization [over values of
the L1 regularization parameter]. Note that more advanced optimization
techniques are generally not practical as the CV objective function
[...] is often noisy." Any thoughts on this?
------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60135031&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general