There are actually work on embedding word sense into vector space, "Word
representations: A simple and general method for semi-supervised learning"
for example.
On Fri, Dec 30, 2011 at 6:26 AM, Robert Layton <[email protected]>wrote:
> On 30 December 2011 08:57, Gael Varoquaux
> <[email protected]>wrote:
>
>> On Thu, Dec 29, 2011 at 09:18:38PM +0100, Bronco Zaurus wrote:
>> > I have a beginner's question: how do you classify using non-numerical
>> > features, concretely strings (for example: 'audi', 'bmw',
>> > 'chevrolet')?
>>
>> You are in trouble as your input space is not metric: what's .5*('audi' +
>> 'chevrolet')? Standard continuous mathematical formulations do not apply.
>>
>> I do believe that they are algorithms to deal with this kind of problems,
>> but the scikit does not implement any, and this is quite far from my area
>> of expertise. My approach would be to look for other kind of features.
>>
>> Sorry for the bad news.
>>
>> Gaël
>>
>>
>> ------------------------------------------------------------------------------
>> Ridiculously easy VDI. With Citrix VDI-in-a-Box, you don't need a complex
>> infrastructure or vast IT resources to deliver seamless, secure access to
>> virtual desktops. With this all-in-one solution, easily deploy virtual
>> desktops for less than the cost of PCs and save 60% on VDI infrastructure
>> costs. Try it free! http://p.sf.net/sfu/Citrix-VDIinabox
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>
>
> Answering the more general question, you are looking for things such as
> the Levenshtein distance<http://en.wikipedia.org/wiki/Levenshtein_distance>
> and
> it's related methods (see the "See also" section of that wikipedia page).
> Some of these methods actually are metrics in the true sense, which will
> make some things easier, but they don't "embed" into a vector space.
>
> For scikits.learn, I don't believe there is much outside of n-grams, which
> are in the feature extraction (which turns the strings into a vector of
> numbers). See
> here<http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text>
>
> Hope that helps.
> --
>
> Public key at: http://pgp.mit.edu/ Search for this email address and
> select the key from "2011-08-19" (key id: 54BA8735)
>
>
>
> ------------------------------------------------------------------------------
> Ridiculously easy VDI. With Citrix VDI-in-a-Box, you don't need a complex
> infrastructure or vast IT resources to deliver seamless, secure access to
> virtual desktops. With this all-in-one solution, easily deploy virtual
> desktops for less than the cost of PCs and save 60% on VDI infrastructure
> costs. Try it free! http://p.sf.net/sfu/Citrix-VDIinabox
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
--
Best Wishes
--------------------------------------------
Meng Xinfan(蒙新泛)
Institute of Computational Linguistics
Department of Computer Science & Technology
School of Electronic Engineering & Computer Science
Peking University
Beijing, 100871
China
------------------------------------------------------------------------------
Ridiculously easy VDI. With Citrix VDI-in-a-Box, you don't need a complex
infrastructure or vast IT resources to deliver seamless, secure access to
virtual desktops. With this all-in-one solution, easily deploy virtual
desktops for less than the cost of PCs and save 60% on VDI infrastructure
costs. Try it free! http://p.sf.net/sfu/Citrix-VDIinabox
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general