2011/12/30 Olivier Grisel <[email protected]>: > Alright, then the name of this kind of features is "categorical > features" in machine learning jargon: the string is used as an > identifier and the ordered sequence of letters is not exploited by the > model. On the opposite "string features" means something very specific > in machine learning jargon (e.g. sequence of DNA nucleotides symbols > when dealing with genetic datasets). > > We probably need to extend the sklearn.feature_extraction.text package > to make it more user friendly to work with with pure categorical > features occurrences:
I'm not sure this belongs in feature_extraction.text; it's much more broadly applicable. If you poke around my branches on GitHub, you'll find some preliminary work on both a one-hot transformer and an ARFF (Weka format) reader. I think the latter would be very convenient for those wanting mixed numerical/categorical data sets. -- Lars Buitinck Scientific programmer, ILPS University of Amsterdam ------------------------------------------------------------------------------ Write once. Port to many. Get the SDK and tools to simplify cross-platform app development. Create new or port existing apps to sell to consumers worldwide. Explore the Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join http://p.sf.net/sfu/intel-appdev _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
