2011/12/30 Olivier Grisel <[email protected]>:
> Alright, then the name of this kind of features is "categorical
> features" in machine learning jargon: the string is used as an
> identifier and the ordered sequence of letters is not exploited by the
> model. On the opposite "string features" means something very specific
> in machine learning jargon (e.g. sequence of DNA nucleotides symbols
> when dealing with genetic datasets).
>
> We probably need to extend the sklearn.feature_extraction.text package
> to make it more user friendly to work with with pure categorical
> features occurrences:

I'm not sure this belongs in feature_extraction.text; it's much more
broadly applicable.

If you poke around my branches on GitHub, you'll find some preliminary
work on both a one-hot transformer and an ARFF (Weka format) reader. I
think the latter would be very convenient for those wanting mixed
numerical/categorical data sets.

-- 
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam

------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create 
new or port existing apps to sell to consumers worldwide. Explore the 
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to