John, you're right about the difference in nomenclature. I've been using scikit-learn's names for the parameters, so the alpha I've referred to is the regularization strength and corresponds to lambda in glmnet. The mixing parameter, referred to in glmnet as alpha, is the L1-ratio in scikit-learn.

Nick, thank you very much for the tip on how the L1 norm of an OLS solution is used to determine the maximum regularization strength for lasso. Thinking about how that would extend to elastic net: with an L1-ratio of 1, alpha_max is the L1 norm of an OLS solution, because elastic net reduces to lasso in this case. But with L1-ratios between zero and one, couldn't alpha_max be greater than the L1 norm of an OLS solution since alpha_max for the elastic net is not the L1 regularization strength, but rather the overall regularization strength, distributed between L1 and L2? As the ElasticNet documentation says, alpha = L1 strength + L2 strength, and L1-ratio= L1 strength / (L1 strength + L2 strength). It seems like the alpha_max for elastic net with a given L1-ratio could be some function of both the L1 and L2 norms of an OLS solution, and it might be a simple combination. But I haven't found it browsing the literature, and I am unsure of how to derive it.

I did find the part in coordinate_descent.py where alpha_max is chosen, but I don't fully understand the reasoning behind it:

   alpha_max = np.abs(Xy).max() / (n_samples * l1_ratio)


Another concern: if the data does not have mean zero and/or unit variance (I've been told this might be ok if, for example, I want to preserve sparsity in the input), might this affect the magnitude of the solution coefficients and hence the calculation of alpha_max?

And I'm still not sure how to pick the smallest value of alpha (or rather "eps," the ratio between the largest and smallest values).

Now for the L1-ratio. The ElasticNetCV class does not automatically choose a set of L1-ratios to test, as it does with the alphas; it's up to the user to supply them. However, it does mention in the documentation for ElasticNetCV:

   /Note that a good choice of list of values for l1_ratio is often to
   put more values close to 1 (i.e. Lasso) and less close to 0 (i.e.
   Ridge), as in //[.1,////.5,////.7,////.9,////.95,////.99,////1]/

I understand John's reasoning that good L1-ratios are likely to be higher the greater the proportion of variables to samples. If anyone knows of other considerations that could go into choosing an appropriate set of L1-ratios, let me know.

Lastly: I was excited about the idea of trying first with a sparse grid and then repeating the search in more detail in the area of parameter values yielding high cross-validation scores. However, I notice in the paper associated with Nick's link that it says "In practice, an upper bound must be selected for any grid-search optimization [over values of the L1 regularization parameter]. Note that more advanced optimization techniques are generally not practical as the CV objective function [...] is often noisy." Any thoughts on this?

------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60135031&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to