Michael McCandless wrote:
Is this a one-time computation?  If so, couldn't you wait a long time
for the machine to simply finish it?
The final "production" computation is one-time, still, I have to recurrently come back and correct some errors, then retry...

With the simple approach (doing 100 million 2-term AND queries), how
long do you estimate it'd take?
About the estimated time, my existing index is really problematic, I should look for ways to optimize it, but I really think analyzer-time frequencies should do the job.
I think you could do this with your own analyzer (as you
suggested)... it would run normal tokenization, gather all unique
terms that occurred, discard the "noise" terms (odd to me that you
don't consider stop words as noise -- or maybe you mean noise (non
salient terms) at the bigram level?)
By noise I mainly meant terms with frequency 1 (misspelled words, garbage escaping my Analyzer). In my current attempts I am really interested by common words.

Thanks for the advice,
Adrian.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to