> On Nov 8, 2015, at 11:32 AM, Raphael C <drr...@gmail.com> wrote: > > In terms of computational efficiency, one-hot encoding combined with > the support for sparse feature vectors seems to work well, at least > for me. I assume therefore > the problem must be in terms of classification accuracy.
One thing comes to mind regarding the different solvers for the linear models. E.g., Newton’s method is O(n * d^2), and even gradient descent is O(n *d) For decision trees, I don’t see a substantial difference in terms of computational complexity if a categorical feature, let’s say it can take 4 values, is split into 4 binary questions (i.e., using one-hot encoding). One the other hand, I think the problem is that the decision algorithm does not no that these 4 binary questions “belong” to one observation, which could make the decision tree grow much larger in depth and width; this is bad for computational efficiency and would more likely produce trees with higher variance. I’d be curious how to handle categorical feature columns implementation-wise though. I think additional parameters in the method call would be necessary (e.g., .fit(categorical=(1, 4, 19), nominal=(1, 4)) to distinguish ordinal from nominal variables? Or, alternatively, I think this would be a good use-case for numpy’s structured arrays? ------------------------------------------------------------------------------ _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general