On 5 November 2015 at 13:38, Gael Varoquaux
<gael.varoqu...@normalesup.org> wrote:
> On Thu, Nov 05, 2015 at 07:05:11AM +0000, Raphael C wrote:
>> https://github.com/szilard/benchm-ml
>
>> The upshot is that in some cases it seems that the scikit-learn
>> versions have room for improvement.
>
> The various main lessons that I can see from those results are:
>
> * Linear models (aka LogisticRegression) don't scale very well:
>
>   - The page benches the default, which is liblinear.
>     I would be very curious to see how the other solvers (Newton, and
>     SAG) fair on this dataset.
>     It would be useful to introduce a 'solver="auto"' for logistic
>     regression, based on heavy benchmarks and heuristics.
>     I have created an issue about this, to discuss if we want to do this:
>     https://github.com/scikit-learn/scikit-learn/issues/5736
>
>   - Having fused types to avoid increased memory would be useful.
>     For this we first need to finish adding cython as a build dependency:
>     https://github.com/scikit-learn/scikit-learn/pull/5492
>
> - In tree-based Not handling categorical variables as such hurts us a lot
>   There's a PR to fix that, it still needs a bit of love:
>   https://github.com/scikit-learn/scikit-learn/pull/4899
>

Thank you for this very helpful reply.  One perhaps naive
question, why does not handling categorical variables hurt a lot?

In terms of computational efficiency, one-hot encoding combined with
the support for sparse feature vectors seems to work well, at least
for me. I assume therefore
the problem must be in terms of classification accuracy. Is that
right and if so, why?

Raphael

------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to