2012/9/4 Andreas Mueller <[email protected]>:
> On 09/04/2012 03:23 PM, Lars Buitinck wrote:
>> What did the input look like? chi2 expects frequencies, i.e. strictly
>> non-negative feature values.
>>
> The inputs were non-negative, but some >1.
That should be perfectly ok, it's designed for the kind of output that
CountVectorizer produces.
I just did some short tests and scipy.stats.chisquare will return
exactly zero probability when the differences between counts per class
are largish.
>>> X = np.array([[0, 10000], [10000, 0]])
>>> observed = X
>>> Y = np.array([[0,1], [1,0]])
>>> expected = np.dot(np.atleast_2d(Y.mean(axis=0)).T,
>>> np.atleast_2d(X.sum(axis=0)))
>>> chisquare(observed, expected)
(array([ 10000., 10000.]), array([ 0., 0.]))
It may not be pretty, but it is expected behavior since the p-value
will be too small to represent. If this is bothering you in your own
code, then you could binarize your features using a cutoff before
handing them to chi2, or use logarithmic scaling on your frequencies
with
TfidfTransformer(sublinear_tf=True, use_idf=False)
As for the broader implications in scikit-learn, I only now see that
the feature selector classes use the p-value instead of the raw
statistic to find out which features are interesting. In the case of
chi2, the raw statistic will be more stable.
--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general