Gang, What's the state of the world on clustering a raft of textual documents? Are all the pieces in place to start from a directory of flat text files, push through Lucene to get the vectors, keep labels on the vectors to point back to the files, and run, say, k-means?
I've got enough data here that skimming off the top few unigrams might also be advisable. I tried running this through Weka, and blew it out of virtual memory. --benson
