Re: LLR Scoring question

Shashikant Kore Tue, 12 Jan 2010 05:57:18 -0800

Not sure, which values you asked for.  Here are the entropy values as
calculated in the following class.


http://svn.apache.org/viewvc/lucene/mahout/trunk/math/src/main/java/org/apache/mahout/math/stats/LogLikelihood.java?view=markup


Term1
rowEntropy      12226
columnEntropy   96057
matrixEntropy   110068
result  3569


Term2
rowEntropy      204240
columnEntropy    96031
matrixEntropy   302083
result  3622


--shashi

On Tue, Jan 12, 2010 at 6:41 PM, Robin Anil <[email protected]> wrote:
> I dont have my code here to verify the result. Can you show the calculation
> here i mean the values of the log etc. Maybe will give a better idea
>
>
> On Tue, Jan 12, 2010 at 6:19 PM, Shashikant Kore <[email protected]>wrote:
>
>> Hi,
>>
>> I am looking at LLR scores for two terms in a cluster which seem
>> non-intuitive to me.
>>
>> The corpus size is 706,120 and size of the cluster is 21964.
>>
>> Term1 appears in 904 docs  in the cluster and  1144 docs outside the
>> cluster.
>> Term2 appears in 36 docs  in the cluster and 60280 docs outside the
>> cluster.
>>
>> As I can see Term1 is rarer outside the cluster, but common in the
>> cluster (relatively speaking.) But, when I calculate LLR scores,
>> Term1's score (3569) is lower than that of Term2 (3622). This looks
>> counter-intuitive to me. Is it the case that LLR score is higher if
>> term is common outside the cluster and rare inside?  Can this be
>> "fixed"?
>>
>> The k11, k12, k21,k22 values for Term1 and Term2 are as follows if you
>> wish to calculate.
>>
>> Term1
>> k11     904
>> k12     21060
>> k21     1144
>> k22     683012
>>
>> Term2
>> k11     36
>> k12     21928
>> k21     60280
>> k22     623876
>>
>> Thanks,
>>
>> --shashi
>>
>

Re: LLR Scoring question

Reply via email to