On 2011-12-29, at 3:18 PM, Bronco Zaurus <[email protected]> wrote:
> Hello, > > I have a beginner's question: how do you classify using non-numerical > features, concretely strings (for example: 'audi', 'bmw', > 'chevrolet')? > > One way that comes to mind is to give each value a number. Is there a > more straightforward way of using string features in sklearn? I'm assuming you're not doing NLP where you're dealing with sequences of arbitrary strings, but rather that you have a set of discrete choices for a feature where each choice is represented by a string. One of the standard tricks here is to code these features as "one hot" vectors (Google for an explanation, but it's fairly simple, code "Audi" with a 1 in a specific spot in a vector with all other entries equal to zero). Assigning a number is generally a bad idea because it imposes an ordering/magnitude that is totally arbitrary and for most algorithms it is not invariant to a permutation of the numerical values. Most learning algorithms will treat one-hot expansions more equitably, e.g. in a linear model a permutation of the order of the one-hot vector followed by re-training will result in the same model with the weights permuted. ------------------------------------------------------------------------------ Ridiculously easy VDI. With Citrix VDI-in-a-Box, you don't need a complex infrastructure or vast IT resources to deliver seamless, secure access to virtual desktops. With this all-in-one solution, easily deploy virtual desktops for less than the cost of PCs and save 60% on VDI infrastructure costs. Try it free! http://p.sf.net/sfu/Citrix-VDIinabox _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
