: The final "production" computation is one-time, still, I have to recurrently : come back and correct some errors, then retry...
this doesn't really seem like a problem ideally suited for Lucene ... this seems like the type of problem sequential batch crunching could solve better... first pass: tokenize each document into a bucket of words second pass: count the occurances of every word, and make a list of all docs where the occurance is greater then N. third pass: filter the word buckets from pass#1 so they only contain words in the list produced by pass#2 fourth pass: generate all pairs of words in every word bucket produced by pass#3 fifth pass: sort and count the uniq pairs produced by pass#4 ...i have a hard time thinking in terms of Ma/Reduce steps, but i'm guessing a Hadoop based app could do all this in a relatively straight forward manner. -Hoss --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org