On Tue, Aug 11, 2009 at 8:57 PM, Ted Dunning<[email protected]> wrote: > If you expand the LLR equation and look at which terms are big, you will see > k_11 * log(mumble) as an important term for many words. Usually, this is > about the same as tf.idf since mumble is about the same as the term > frequency. For a single document, tf.idf is a very close approximation of > LLR. With many documents, the situation can change dramatically, however, > and you can get cancellation effects that eliminate documents that would > otherwise have high tf.idf. These are generally the terms that lead to > over-fitting with methods like naive bayes and are often not such great > cluster descriptors. >
Let me restate what I understood. If a phrase is identified as prominent phrase by LLR and it also happens to be the top-weighted feature in the centroid vector, it is not a good candidate for cluster label. Is this correct? --shashi
