Re: number of hits of pages containing two terms

Chris Hostetter Tue, 17 Mar 2009 17:55:05 -0700

: The final "production" computation is one-time, still, I have to recurrently
: come back and correct some errors, then retry...


this doesn't really seem like a problem ideally suited for Lucene ... this 
seems like the type of problem sequential batch crunching could solve 
better...

first pass: tokenize each document into a bucket of words

second pass: count the occurances of every word, and make a list of all 
docs where the occurance is greater then N.

third pass: filter the word buckets from pass#1 so they only contain 
words in the list produced by pass#2

fourth pass: generate all pairs of words in every word bucket produced 
by pass#3

fifth pass: sort and count the uniq pairs produced by pass#4


...i have a hard time thinking in terms of Ma/Reduce steps, but i'm 
guessing a Hadoop based app could do all this in a relatively straight 
forward manner.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: number of hits of pages containing two terms

Reply via email to