On 30 December 2011 08:57, Gael Varoquaux <[email protected]>wrote:

> On Thu, Dec 29, 2011 at 09:18:38PM +0100, Bronco Zaurus wrote:
> >    I have a beginner's question: how do you classify using non-numerical
> >    features, concretely strings (for example: 'audi', 'bmw',
> >    'chevrolet')?
>
> You are in trouble as your input space is not metric: what's .5*('audi' +
> 'chevrolet')? Standard continuous mathematical formulations do not apply.
>
> I do believe that they are algorithms to deal with this kind of problems,
> but the scikit does not implement any, and this is quite far from my area
> of expertise. My approach would be to look for other kind of features.
>
> Sorry for the bad news.
>
> Gaƫl
>
>
> ------------------------------------------------------------------------------
> Ridiculously easy VDI. With Citrix VDI-in-a-Box, you don't need a complex
> infrastructure or vast IT resources to deliver seamless, secure access to
> virtual desktops. With this all-in-one solution, easily deploy virtual
> desktops for less than the cost of PCs and save 60% on VDI infrastructure
> costs. Try it free! http://p.sf.net/sfu/Citrix-VDIinabox
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>


Answering the more general question, you are looking for things such
as the Levenshtein
distance <http://en.wikipedia.org/wiki/Levenshtein_distance> and it's
related methods (see the "See also" section of that wikipedia page).
Some of these methods actually are metrics in the true sense, which will
make some things easier, but they don't "embed" into a vector space.

For scikits.learn, I don't believe there is much outside of n-grams, which
are in the feature extraction (which turns the strings into a vector of
numbers). See 
here<http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text>

Hope that helps.
-- 

Public key at: http://pgp.mit.edu/ Search for this email address and select
the key from "2011-08-19" (key id: 54BA8735)
------------------------------------------------------------------------------
Ridiculously easy VDI. With Citrix VDI-in-a-Box, you don't need a complex
infrastructure or vast IT resources to deliver seamless, secure access to
virtual desktops. With this all-in-one solution, easily deploy virtual 
desktops for less than the cost of PCs and save 60% on VDI infrastructure 
costs. Try it free! http://p.sf.net/sfu/Citrix-VDIinabox
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to