Clusters-i directory is for each iteration and points is the folder where you have the final output data in consumable format. For example, in FuzzyKMeans, the clusters-0 directory contains a format like clustersid\tclusterVector as key value pair. This will be consumed by next iteration to read the centriods. Where as, the points directory contains data as itemVector\tclusterProbabilities. This gives you the item and the cluster probabilities (p(cluster/item) for this item.

Thanks
Pallavi



Bogdan Vatkov wrote:
Is there a description of the output structure of the results, I see also
some folders like points which is used by the ClusterDumper but I do not
know the technical details.
I would be interested what kind of data is available as a result of the
clustering. Is it different when different algorithm is used (kmeans,
canopy, dirichlet)?

I also have one more theoretical question: I get for the cluster with the
highest "points" a term - the third by weight which is at the same time with
word freq = 9 - according to Solr Dictionary (and according to my knowledge
of the corpora too) - this is for 23 000+ input docs. Is it something with
the kmeans algorithm? the rest of the terms, clusters seem to be somehow ok,
but that one really astonished me, I am almost sure it is not a problem with
the (index - dictionary mapping) like I had before ;) (but that was general
problem then - I was using the wrong dictionary file).
I am running with convergence 0.5 is that ok?

Best regards,
Bogdan

Reply via email to