OK, I just committed a change to the ClusterDumper that can now use a dictionary of terms to print out the values in the cell of the vector for the centroid. It does this by default instead of using the vector.asFormatString() capability. If you want the old functionality, pass in -j
I also applied the same functionality to VectorDumper. Let me know if that helps. On Jan 2, 2010, at 3:11 PM, Drew Farris wrote: > I've managed to get k-means clustering working, but I agree it would be very > nice to have an end-to-end example that would allow others to get up to > speed quickly. I think the largest holes here are related to the vacuum of a > corpus of text into the Lucene index and the presentation of a > human-readable display of the results. It might be interesting to also > calculate and include some metrics such as the F-measure (in cases where we > have a reference categorization) and scatter score (in cases where we > don't). > > The existing LDA example would be a useful starting point. It slurps > in the Reuters-21578 > corpus <http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html>, > converts it to text, loads it into a Lucene index, extracts vectors from the > lucene index and runs LDA upon them. > > This example uses the lucene benchmark utilities for the input to text > conversion and lucene loading. The benchmark utilities code is readable but > complex. It would be very nice to have a simple piece of code to handle the > creation of the Lucene index that others can easilly build upon to respond > to their existing corpus. > > On Sat, Jan 2, 2010 at 2:10 PM, Benson Margulies <[email protected]> > wrote: >> As someone who tried, not hard enough, and failed, to assemble all >> these bits in a row, I can only say that the situation cries out for >> an end-to-end sample. I'd be willing to help lick it into shape to be >> checked-in as such. My idea is that it should set up to vacuum-cleaner >> up a corpus of text, push it through Lucene, pull it out as vectors, >> tickle the pig hadoop, and deliver actual doc paths arranged by >> cluster. >> -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
