Re: number of hits of pages containing two terms

Adrian Dimulescu Tue, 17 Mar 2009 09:58:41 -0700

Michael McCandless wrote:

Is this a one-time computation?  If so, couldn't you wait a long time
for the machine to simply finish it?

The final "production" computation is one-time, still, I have torecurrently come back and correct some errors, then retry...


With the simple approach (doing 100 million 2-term AND queries), how
long do you estimate it'd take?

About the estimated time, my existing index is really problematic, Ishould look for ways to optimize it, but I really think analyzer-timefrequencies should do the job.

I think you could do this with your own analyzer (as you
suggested)... it would run normal tokenization, gather all unique
terms that occurred, discard the "noise" terms (odd to me that you
don't consider stop words as noise -- or maybe you mean noise (non
salient terms) at the bigram level?)

By noise I mainly meant terms with frequency 1 (misspelled words,garbage escaping my Analyzer). In my current attempts I am reallyinterested by common words.


Thanks for the advice,
Adrian.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: number of hits of pages containing two terms

Reply via email to