Ted, Thank you for the tip.
> > rootLLR = signum(k11/k1* - k21/k2*) * sqrt(LLR) > I didn't get what k1* and k2* are. I used (k11+k12) and (k21+k22) in the denominator. That gives correct result. --shashi On Wed, Jan 13, 2010 at 12:50 AM, Ted Dunning <[email protected]> wrote: > Raw LLR has a large value whenever there is an anomaly. In this case, term2 > is rare in the cluster and common outside and is thus an anomaly. > > One thing that I do is to use a variant of the LLR score: > > rootLLR = signum(k11/k1* - k21/k2*) * sqrt(LLR) > > This score has two advantages over the basic LLR: > > a) it is positive where k11 is bigger than expected, negative where it is > lower. This resolves your current problem. > > b) if there is no difference it is asymptotically normally distributed. > This allows people to talk about "number of standard deviations" which is a > more common frame of reference than the chi^2 distribution. > > > On Tue, Jan 12, 2010 at 4:49 AM, Shashikant Kore <[email protected]>wrote: > >> As I can see Term1 is rarer outside the cluster, but common in the >> cluster (relatively speaking.) But, when I calculate LLR scores, >> Term1's score (3569) is lower than that of Term2 (3622). This looks >> counter-intuitive to me. Is it the case that LLR score is higher if >> term is common outside the cluster and rare inside? Can this be >> "fixed"? >> > > > > -- > Ted Dunning, CTO > DeepDyve >
