Whoa.... No. It sounds like I have muddied things thoroughly. What I was saying is that there are times that tf.idf and llr agree and times that tf.idf and llr disagree. In my experience, most of the second category are where tf.idf is over-weighting coincidental cases or where both scores are producing not good stuff.
If a phrase or term is marked as good by LLR and is a prominent feature of the centroid, that is fine. On Tue, Aug 11, 2009 at 10:54 AM, Shashikant Kore <[email protected]>wrote: > On Tue, Aug 11, 2009 at 8:57 PM, Ted Dunning<[email protected]> wrote: > > If you expand the LLR equation and look at which terms are big, you will > see > > k_11 * log(mumble) as an important term for many words. Usually, this > is > > about the same as tf.idf since mumble is about the same as the term > > frequency. For a single document, tf.idf is a very close approximation > of > > LLR. With many documents, the situation can change dramatically, > however, > > and you can get cancellation effects that eliminate documents that would > > otherwise have high tf.idf. These are generally the terms that lead to > > over-fitting with methods like naive bayes and are often not such great > > cluster descriptors. > > > > Let me restate what I understood. > > If a phrase is identified as prominent phrase by LLR and it also > happens to be the top-weighted feature in the centroid vector, it is > not a good candidate for cluster label. > > Is this correct? -- Ted Dunning, CTO DeepDyve
