Hi,

I am looking at LLR scores for two terms in a cluster which seem
non-intuitive to me.

The corpus size is 706,120 and size of the cluster is 21964.

Term1 appears in 904 docs  in the cluster and  1144 docs outside the cluster.
Term2 appears in 36 docs  in the cluster and 60280 docs outside the cluster.

As I can see Term1 is rarer outside the cluster, but common in the
cluster (relatively speaking.) But, when I calculate LLR scores,
Term1's score (3569) is lower than that of Term2 (3622). This looks
counter-intuitive to me. Is it the case that LLR score is higher if
term is common outside the cluster and rare inside?  Can this be
"fixed"?

The k11, k12, k21,k22 values for Term1 and Term2 are as follows if you
wish to calculate.

Term1
k11     904     
k12     21060   
k21     1144    
k22     683012  

Term2
k11     36      
k12     21928   
k21     60280   
k22     623876  

Thanks,

--shashi

Reply via email to