Re: [Scikit-learn-general] using string features for classification

David Warde-Farley Thu, 29 Dec 2011 21:45:43 -0800

On 2011-12-29, at 3:18 PM, Bronco Zaurus <[email protected]> wrote:


> Hello, 
> 
> I have a beginner's question: how do you classify using non-numerical 
> features, concretely strings (for example: 'audi', 'bmw', 
> 'chevrolet')? 
> 
> One way that comes to mind is to give each value a number. Is there a 
>  more straightforward way of using string features in sklearn?

I'm assuming you're not doing NLP where you're dealing with sequences of 
arbitrary strings, but rather that you have a set of discrete choices for a 
feature where each choice is represented by a string. One of the standard 
tricks here is to code these features as "one hot" vectors (Google for an 
explanation, but it's fairly simple, code "Audi" with a 1 in a specific spot in 
a vector with all other entries equal to zero). 

Assigning a number is generally a bad idea because it imposes an 
ordering/magnitude that is totally arbitrary and for most algorithms it is not 
invariant to a permutation of the numerical values. Most learning algorithms 
will treat one-hot expansions more equitably, e.g. in a linear model a 
permutation of the order of the one-hot vector followed by re-training will 
result in the same model with the weights permuted.
------------------------------------------------------------------------------
Ridiculously easy VDI. With Citrix VDI-in-a-Box, you don't need a complex
infrastructure or vast IT resources to deliver seamless, secure access to
virtual desktops. With this all-in-one solution, easily deploy virtual 
desktops for less than the cost of PCs and save 60% on VDI infrastructure 
costs. Try it free! http://p.sf.net/sfu/Citrix-VDIinabox
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] using string features for classification

Reply via email to