I should add that for collocations, this almost never matters because a pair
of words can only occur less than expected if one of the words is very
common.

The only example in English that I know off-hand is the phrase "the the"
which does occur (due to typographical error, generally), but because the is
sooo common, it occurs less than expected.

Any word that cooccurs with a less common word than "the" will tend to have
a very low expected frequency.  As such, it is hard to have a non-zero
frequency that is less than expected.  Even zero occurrences is not a whole
lot less than the expected frequency unless you have a truly ginormous
corpus.

For the case of cluster labeling or classification features, however, it is
quite plausible for a feature to be less common in the cluster of interest
than in the rest of the corpus and because the cluster may be relatively
large, it is also quite plausible for this feature to have non-zero count
and a pretty respectable LLR.


On Tue, Jan 12, 2010 at 11:20 AM, Ted Dunning <[email protected]> wrote:

>
> Raw LLR has a large value whenever there is an anomaly.  In this case,
> term2 is rare in the cluster and common outside and is thus an anomaly.
>
> One thing that I do is to use a variant of the LLR score:
>
>     rootLLR = signum(k11/k1* - k21/k2*) * sqrt(LLR)
>
> This score has two advantages over the basic LLR:
>
> a) it is positive where k11 is bigger than expected, negative where it is
> lower.  This resolves your current problem.
>
> b) if there is no difference it is asymptotically normally distributed.
> This allows people to talk about "number of standard deviations" which is a
> more common frame of reference than the chi^2 distribution.
>
>
>
> On Tue, Jan 12, 2010 at 4:49 AM, Shashikant Kore <[email protected]>wrote:
>
>> As I can see Term1 is rarer outside the cluster, but common in the
>> cluster (relatively speaking.) But, when I calculate LLR scores,
>> Term1's score (3569) is lower than that of Term2 (3622). This looks
>> counter-intuitive to me. Is it the case that LLR score is higher if
>> term is common outside the cluster and rare inside?  Can this be
>> "fixed"?
>>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>
>


-- 
Ted Dunning, CTO
DeepDyve

Reply via email to