Hi, I am looking at LLR scores for two terms in a cluster which seem non-intuitive to me.
The corpus size is 706,120 and size of the cluster is 21964. Term1 appears in 904 docs in the cluster and 1144 docs outside the cluster. Term2 appears in 36 docs in the cluster and 60280 docs outside the cluster. As I can see Term1 is rarer outside the cluster, but common in the cluster (relatively speaking.) But, when I calculate LLR scores, Term1's score (3569) is lower than that of Term2 (3622). This looks counter-intuitive to me. Is it the case that LLR score is higher if term is common outside the cluster and rare inside? Can this be "fixed"? The k11, k12, k21,k22 values for Term1 and Term2 are as follows if you wish to calculate. Term1 k11 904 k12 21060 k21 1144 k22 683012 Term2 k11 36 k12 21928 k21 60280 k22 623876 Thanks, --shashi
