I've managed to get k-means clustering working, but I agree it would be very nice to have an end-to-end example that would allow others to get up to speed quickly. I think the largest holes here are related to the vacuum of a corpus of text into the Lucene index and the presentation of a human-readable display of the results. It might be interesting to also calculate and include some metrics such as the F-measure (in cases where we have a reference categorization) and scatter score (in cases where we don't).
The existing LDA example would be a useful starting point. It slurps in the Reuters-21578 corpus <http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html>, converts it to text, loads it into a Lucene index, extracts vectors from the lucene index and runs LDA upon them. This example uses the lucene benchmark utilities for the input to text conversion and lucene loading. The benchmark utilities code is readable but complex. It would be very nice to have a simple piece of code to handle the creation of the Lucene index that others can easilly build upon to respond to their existing corpus. On Sat, Jan 2, 2010 at 2:10 PM, Benson Margulies <[email protected]> wrote: > As someone who tried, not hard enough, and failed, to assemble all > these bits in a row, I can only say that the situation cries out for > an end-to-end sample. I'd be willing to help lick it into shape to be > checked-in as such. My idea is that it should set up to vacuum-cleaner > up a corpus of text, push it through Lucene, pull it out as vectors, > tickle the pig hadoop, and deliver actual doc paths arranged by > cluster. >
