Cluster text docs

Benson Margulies Thu, 17 Dec 2009 15:31:09 -0800

Gang,

What's the state of the world on clustering a raft of textual
documents? Are all the pieces in place to start from a directory of
flat text files, push through Lucene to get the vectors, keep labels
on the vectors to point back to the files, and run, say, k-means?


I've got enough data here that skimming off the top few unigrams might
also be advisable.

I tried running this through Weka, and blew it out of virtual memory.

--benson

Cluster text docs

Reply via email to