On Jun 16, 2009, at 11:43 PM, Shashikant Kore wrote:
I had hacked the code to put labels for the vectors.
OK, so we've put a lot of this in place now with MAHOUT-65.
Then I modified
KMeans to output the document label, Cluster ID, and distance from the
cluster.
Do you think there is a way to make this generic for all of the
clustering jobs? Seems like this would be handy to have in the new
Utils module I'm working on for MAHOUT-126 (committing today)
Care to throw up a patch as a starting point like you did for
MAHOUT-126?
Another utility takes this input and converts labels to the
actual text files from which it is created. Then I do random checks
manually for the documents in a cluster.
OK, so ad hoc. Definitely a reasonable thing to do at this point.
I wonder if we could hook into Carrot2 visualization tools at all.
They have some really nice tools and perhaps we can output our stuff
in a way that works for them. I imagine Weka does too. I suppose this
all gets back to supporting more common input/output formats.
Although, it seems the JSON (GSON) stuff is pretty powerful that way
too.
-Grant