I was referring to the condition where a phrase is identifies as good
by LLR and is also prominent feature of centroid.  But, as you
clarified, only LLR score is good indicator for top labels.

Thanks for the pointer for co-occurrence statistics. I will study some
literature on that.

--shashi

On Wed, Aug 12, 2009 at 11:23 PM, Ted Dunning<[email protected]> wrote:
> On Wed, Aug 12, 2009 at 6:12 AM, Shashikant Kore <[email protected]>wrote:
>
>>
>> Is this a necessary & sufficient  condition for a good cluster label?
>
>
> I am not entirely clear what "this" is.  My assertion is that high LLR score
> is sufficient evidence to use the term or phrase.  I generally also limit
> the number of terms as well, taking only the highest scoring ones.  The
> necessary and sufficient phrase comes from a rigorous mathematical
> background that doesn't entirely apply here where we are talking about
> heuristics like this.
>
>
>> On a different note,  is there any way to identify relationship among
>> the top labels of the clusters? For example, if I have cluster related
>> automobiles, I may get the companies (GM, Ford, Toyota) along with
>> their poupular models (Corolla,  Cadillac, ) as top labels. How can I
>> figure out Toyota and Corolla are strongly related?
>
>
> Look at the co-occurrence statistics of the terms themselves.  Use that to
> form a sparse graph.  Then do spectral clustering or agglomerative
> clustering on the graph.
>
> That will give you clusters of terms that will give you much of what you
> seek.  Of course, the fact that the terms are being used to describe the
> same cluster means that you have a good chance of just replicating the label
> sets for your clusters.
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Reply via email to