> On Nov 8, 2015, at 11:32 AM, Raphael C <drr...@gmail.com> wrote:
> 
> In terms of computational efficiency, one-hot encoding combined with
> the support for sparse feature vectors seems to work well, at least
> for me. I assume therefore
> the problem must be in terms of classification accuracy. 

One thing comes to mind regarding the different solvers for the linear models. 
E.g., Newton’s method is O(n * d^2), and even gradient descent is O(n *d)

For decision trees, I don’t see a substantial difference in terms of 
computational complexity if a categorical feature, let’s say it can take 4 
values, is split into 4 binary questions (i.e., using one-hot encoding). One 
the other hand, I think the problem is that the decision algorithm does not no 
that these 4 binary questions “belong” to one observation, which could make the 
decision tree grow much larger in depth and width; this is bad for 
computational efficiency and would more likely produce trees with higher 
variance.

I’d be curious how to handle categorical feature columns implementation-wise 
though. I think additional parameters in the method call would be necessary 
(e.g., .fit(categorical=(1, 4, 19), nominal=(1, 4)) to distinguish ordinal from 
nominal variables? 
Or, alternatively, I think this would be a good use-case for numpy’s structured 
arrays?



------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to