Re: number of hits of pages containing two terms

2009-03-17 Thread Adrian Dimulescu
Thank you. I suppose the solution for this is to not create an index but to store co-occurence frequencies at Analyzer level. Adrian. On Mon, Mar 16, 2009 at 11:37 AM, Michael McCandless luc...@mikemccandless.com wrote: Be careful: docFreq does not take deletions into account.

Re: number of hits of pages containing two terms

2009-03-17 Thread Michael McCandless
Adrian Dimulescu wrote: Thank you. I suppose the solution for this is to not create an index but to store co-occurence frequencies at Analyzer level. I don't understand how this would address the docFreq does not reflect deletions. You can use the shingles analyzer (under

Re: number of hits of pages containing two terms

2009-03-17 Thread Ian Lea
This is all getting very complicated! Adrian - have you looked any further into why your original two term query was too slow? My experience is that simple queries are usually extremely fast. Standard questions: have you warmed up the searcher? How large is the index? How many occurrences of

Re: number of hits of pages containing two terms

2009-03-17 Thread Adrian Dimulescu
Michael McCandless wrote: I don't understand how this would address the docFreq does not reflect deletions. Bad mail-quoting, sorry. I am not interested by document deletion, I just index Wikipedia once, and want to get a co-occurrence-based similarity distance between words called NGD

Re: number of hits of pages containing two terms

2009-03-17 Thread Adrian Dimulescu
Ian Lea wrote: Adrian - have you looked any further into why your original two term query was too slow? My experience is that simple queries are usually extremely fast. Let me first point out that it is not too slow in absolute terms, it is only for my particular needs of attempting the

Re: number of hits of pages containing two terms

2009-03-17 Thread Ian Lea
OK - thanks for the explanation. So this is not just a simple search ... I'll go away and leave you and Michael and the other experts to talk about clever solutions. -- Ian. On Tue, Mar 17, 2009 at 11:35 AM, Adrian Dimulescu adrian.dimule...@gmail.com wrote: Ian Lea wrote: Adrian - have

Re: number of hits of pages containing two terms

2009-03-17 Thread Michael McCandless
Is this a one-time computation? If so, couldn't you wait a long time for the machine to simply finish it? With the simple approach (doing 100 million 2-term AND queries), how long do you estimate it'd take? I think you could do this with your own analyzer (as you suggested)... it would run

Re: number of hits of pages containing two terms

2009-03-17 Thread Adrian Dimulescu
Michael McCandless wrote: Is this a one-time computation? If so, couldn't you wait a long time for the machine to simply finish it? The final production computation is one-time, still, I have to recurrently come back and correct some errors, then retry... With the simple approach (doing 100

Re: number of hits of pages containing two terms

2009-03-17 Thread Paul Elschot
You may want to try Filters (starting from TermFilter) for this, especially those based on the default OpenBitSet (see the intersection count method) because of your interest in stop words. 10k OpenBitSets for 39 M docs will probably not fit in memory in one go, but that can be worked around by

Re: number of hits of pages containing two terms

2009-03-17 Thread Chris Hostetter
: The final production computation is one-time, still, I have to recurrently : come back and correct some errors, then retry... this doesn't really seem like a problem ideally suited for Lucene ... this seems like the type of problem sequential batch crunching could solve better... first

number of hits of pages containing two terms

2009-03-16 Thread Adrian Dimulescu
Hello, I need the number of pages that contain two terms. Only the number of hits, I don't care about retrieving the pages. Right now I am using the following code in order to get it: Term first, second; TermQuery q1 = new TermQuery(first); TermQuery q2 = new TermQuery(second);

Re: number of hits of pages containing two terms

2009-03-16 Thread Michael McCandless
Adrian Dimulescu wrote: Hello, I need the number of pages that contain two terms. Only the number of hits, I don't care about retrieving the pages. Right now I am using the following code in order to get it: Term first, second; TermQuery q1 = new TermQuery(first); TermQuery q2 = new