Op 17 maart 2012 13:25 heeft Conrad Lee <[email protected]> het
volgende geschreven:
> The google prediction API seems to do some of this automatic detection of
> whether a feature is categorical or numerical.  For example, if at least one
> value of a feature is a string, then they treat that feature as categorical.
>  I'd say that's pretty reasonable.

Right. This is exactly what the proposed DictVectorizer [1] now does.
Which, btw., needs just a bit more reviewing before it can be pulled
:)

> We could go further and count the number of unique values for each attribute
> and compare that with the total number of examples.  If there are the number
> of examples >> number of unique values, then we could infer that it's
> categorical.  However, this is not correct in all situations, so it's
> probably going too far, and I don't really recommend that.

I think this may easily go wrong when feature values are raw
frequencies, e.g. in document classification. Suppose a feature has
been observed twice in one sample, once in another, and nowhere else;
then it would be considered categorical by your heuristic.

> Have other people dealt with this problem of automatically inferring whether
> a feature is numeric or categorical?  If users want this kind of stuff done
> automatically, a safer way to do it would be to make them use arff files or
> something of that nature.  Does scikit-learn support arff files? In this
> file format, each feature is explicitly labeled as numeric, categorical,
> string, or date.

No, ARFF is not supported, but I did once hack up an ARFF loader based
on the one in SciPy. It's in my arff branch. [2] It's a long time
since I looked at that code, but I think categorical feature handling
was still on the todo list.

[1] https://github.com/scikit-learn/scikit-learn/pull/686
[2] https://github.com/larsmans/scikit-learn/tree/arff

-- 
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to