As someone who tried, not hard enough, and failed, to assemble all these bits in a row, I can only say that the situation cries out for an end-to-end sample. I'd be willing to help lick it into shape to be checked-in as such. My idea is that it should set up to vacuum-cleaner up a corpus of text, push it through Lucene, pull it out as vectors, tickle the pig hadoop, and deliver actual doc paths arranged by cluster.
On Sat, Jan 2, 2010 at 1:44 PM, Ted Dunning <[email protected]> wrote: > Since k-means is a hard clustering, that term should appear in no more than > 2 clusters and even that is very unlikely. It is also very unlikely if the > cluster explanation would return that term as a top term even if it appeared > in just one cluster. > > This could be some confusion in turning the id's back into terms. It > definitely does indicate serious problems. > > On Sat, Jan 2, 2010 at 10:27 AM, Bogdan Vatkov <[email protected]>wrote: > >> How is this even possible - for 23, 000 docs and for a term which is >> mentioned only 2 times I have it as a top term in 9 clusters? I definitely >> did something wrong, do you have an idea what that could be? >> > > > > -- > Ted Dunning, CTO > DeepDyve >
