2011/12/30 Bronco Zaurus <[email protected]>:
> Thank you for all the answers. Yes, I'm not dealing with arbitrary strings,
> just a set of possible values, so the binary representation seems OK.
Alright, then the name of this kind of features is "categorical
features" in machine learning jargon: the string is used as an
identifier and the ordered sequence of letters is not exploited by the
model. On the opposite "string features" means something very specific
in machine learning jargon (e.g. sequence of DNA nucleotides symbols
when dealing with genetic datasets).
We probably need to extend the sklearn.feature_extraction.text package
to make it more user friendly to work with with pure categorical
features occurrences:
>>> data = [
... {'feature_1', 'feature_3'},
... {'feature_2'},
... ]
>>> vec = CategoricalVectorizer().fit(data)
>>> vec.vocabulary
{'feature_1': 0, 'feature_2': 2, 'feature_3': 1}
>>> vec.transform(X).toarray()
array([[1. , 0. , 1. ],
[0. , 1. , 0. ]])
or with numerically valued features with names in dictionaries:
>>> data = [
... {'feature_1': 0.1, 'feature_3': 42.},
... {'feature_2': 0.4},
... ]
>>> vec = NamedNumericalVectorizer().fit(data)
>>> vec.vocabulary
{'feature_1': 0, 'feature_2': 2, 'feature_3': 1}
>>> vec.transform(X).toarray()
array([[0.1, 0.0, 42. ],
[0. , 0.4, 0. ]])
We could also have a "smart" vectorizer that could deal with a list of
arbitrary nested python dict / set / list with string and float
literals and automatically turns this into a CSR matrix
representation.
>>> data = [
... {'feature_1': 'value_1', 'feature_3': 42.},
... {'feature_2': 0.4},
... ]
>>> vec = SmartVectorizer().fit(data)
>>> vec.vocabulary
{'feature_1/value_1': 0, 'feature_2': 2, 'feature_3': 1}
>>> vec.transform(X).toarray()
array([[1. , 0.0, 42. ],
[0. , 0.4, 0. ]])
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
Ridiculously easy VDI. With Citrix VDI-in-a-Box, you don't need a complex
infrastructure or vast IT resources to deliver seamless, secure access to
virtual desktops. With this all-in-one solution, easily deploy virtual
desktops for less than the cost of PCs and save 60% on VDI infrastructure
costs. Try it free! http://p.sf.net/sfu/Citrix-VDIinabox
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general