The problem was not in your corpus, but in your test (to misquote Marc
Antony).

See my surprise and coincidence paper that shows why chi-squared tests are
evil <http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.54.2186>for
this sort of application.  The short answer is that they over-estimate how
excited you should be by as much as 300 orders of magnitude.

On Wed, Aug 5, 2009 at 3:26 PM, Tanton Gibbs <[email protected]> wrote:

> The problem was that the collection
> was so large that ANY repeated connection looked statistically
> significant (I was using chi-squares).  I eventually had to apply a
> cutoff, but I wonder if there was a more elegant way to do it.  I
> realize this is not the same thing as the OP's question - hope you
> don't mind :)
>



-- 
Ted Dunning, CTO
DeepDyve

Reply via email to