Not sure, which values you asked for. Here are the entropy values as calculated in the following class.
http://svn.apache.org/viewvc/lucene/mahout/trunk/math/src/main/java/org/apache/mahout/math/stats/LogLikelihood.java?view=markup Term1 rowEntropy 12226 columnEntropy 96057 matrixEntropy 110068 result 3569 Term2 rowEntropy 204240 columnEntropy 96031 matrixEntropy 302083 result 3622 --shashi On Tue, Jan 12, 2010 at 6:41 PM, Robin Anil <[email protected]> wrote: > I dont have my code here to verify the result. Can you show the calculation > here i mean the values of the log etc. Maybe will give a better idea > > > On Tue, Jan 12, 2010 at 6:19 PM, Shashikant Kore <[email protected]>wrote: > >> Hi, >> >> I am looking at LLR scores for two terms in a cluster which seem >> non-intuitive to me. >> >> The corpus size is 706,120 and size of the cluster is 21964. >> >> Term1 appears in 904 docs in the cluster and 1144 docs outside the >> cluster. >> Term2 appears in 36 docs in the cluster and 60280 docs outside the >> cluster. >> >> As I can see Term1 is rarer outside the cluster, but common in the >> cluster (relatively speaking.) But, when I calculate LLR scores, >> Term1's score (3569) is lower than that of Term2 (3622). This looks >> counter-intuitive to me. Is it the case that LLR score is higher if >> term is common outside the cluster and rare inside? Can this be >> "fixed"? >> >> The k11, k12, k21,k22 values for Term1 and Term2 are as follows if you >> wish to calculate. >> >> Term1 >> k11 904 >> k12 21060 >> k21 1144 >> k22 683012 >> >> Term2 >> k11 36 >> k12 21928 >> k21 60280 >> k22 623876 >> >> Thanks, >> >> --shashi >> >
