On 5 November 2015 at 13:38, Gael Varoquaux <gael.varoqu...@normalesup.org> wrote: > On Thu, Nov 05, 2015 at 07:05:11AM +0000, Raphael C wrote: >> https://github.com/szilard/benchm-ml > >> The upshot is that in some cases it seems that the scikit-learn >> versions have room for improvement. > > The various main lessons that I can see from those results are: > > * Linear models (aka LogisticRegression) don't scale very well: > > - The page benches the default, which is liblinear. > I would be very curious to see how the other solvers (Newton, and > SAG) fair on this dataset. > It would be useful to introduce a 'solver="auto"' for logistic > regression, based on heavy benchmarks and heuristics. > I have created an issue about this, to discuss if we want to do this: > https://github.com/scikit-learn/scikit-learn/issues/5736 > > - Having fused types to avoid increased memory would be useful. > For this we first need to finish adding cython as a build dependency: > https://github.com/scikit-learn/scikit-learn/pull/5492 > > - In tree-based Not handling categorical variables as such hurts us a lot > There's a PR to fix that, it still needs a bit of love: > https://github.com/scikit-learn/scikit-learn/pull/4899 >
Thank you for this very helpful reply. One perhaps naive question, why does not handling categorical variables hurt a lot? In terms of computational efficiency, one-hot encoding combined with the support for sparse feature vectors seems to work well, at least for me. I assume therefore the problem must be in terms of classification accuracy. Is that right and if so, why? Raphael ------------------------------------------------------------------------------ _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general