Re: Clustering techniques, tips and tricks

Pallavi Palleti Tue, 05 Jan 2010 18:56:42 -0800

Clusters-i directory is for each iteration and points is the folderwhere you have the final output data in consumable format. For example,in FuzzyKMeans, the clusters-0 directory contains a format likeclustersid\tclusterVector as key value pair. This will be consumed bynext iteration to read the centriods. Where as, the points directorycontains data as itemVector\tclusterProbabilities. This gives you theitem and the cluster probabilities (p(cluster/item) for this item.


Thanks
Pallavi




Bogdan Vatkov wrote:

Is there a description of the output structure of the results, I see also
some folders like points which is used by the ClusterDumper but I do not
know the technical details.
I would be interested what kind of data is available as a result of the
clustering. Is it different when different algorithm is used (kmeans,
canopy, dirichlet)?

I also have one more theoretical question: I get for the cluster with the
highest "points" a term - the third by weight which is at the same time with
word freq = 9 - according to Solr Dictionary (and according to my knowledge
of the corpora too) - this is for 23 000+ input docs. Is it something with
the kmeans algorithm? the rest of the terms, clusters seem to be somehow ok,
but that one really astonished me, I am almost sure it is not a problem with
the (index - dictionary mapping) like I had before ;) (but that was general
problem then - I was using the wrong dictionary file).
I am running with convergence 0.5 is that ok?

Best regards,
Bogdan

Re: Clustering techniques, tips and tricks

Reply via email to