Op 17 maart 2012 13:25 heeft Conrad Lee <[email protected]> het volgende geschreven: > The google prediction API seems to do some of this automatic detection of > whether a feature is categorical or numerical. For example, if at least one > value of a feature is a string, then they treat that feature as categorical. > I'd say that's pretty reasonable.
Right. This is exactly what the proposed DictVectorizer [1] now does. Which, btw., needs just a bit more reviewing before it can be pulled :) > We could go further and count the number of unique values for each attribute > and compare that with the total number of examples. If there are the number > of examples >> number of unique values, then we could infer that it's > categorical. However, this is not correct in all situations, so it's > probably going too far, and I don't really recommend that. I think this may easily go wrong when feature values are raw frequencies, e.g. in document classification. Suppose a feature has been observed twice in one sample, once in another, and nowhere else; then it would be considered categorical by your heuristic. > Have other people dealt with this problem of automatically inferring whether > a feature is numeric or categorical? If users want this kind of stuff done > automatically, a safer way to do it would be to make them use arff files or > something of that nature. Does scikit-learn support arff files? In this > file format, each feature is explicitly labeled as numeric, categorical, > string, or date. No, ARFF is not supported, but I did once hack up an ARFF loader based on the one in SciPy. It's in my arff branch. [2] It's a long time since I looked at that code, but I think categorical feature handling was still on the todo list. [1] https://github.com/scikit-learn/scikit-learn/pull/686 [2] https://github.com/larsmans/scikit-learn/tree/arff -- Lars Buitinck Scientific programmer, ILPS University of Amsterdam ------------------------------------------------------------------------------ This SF email is sponsosred by: Try Windows Azure free for 90 days Click Here http://p.sf.net/sfu/sfd2d-msazure _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
