Re: number of hits of pages containing two terms

Paul Elschot Tue, 17 Mar 2009 10:04:17 -0700

You may want to try Filters (starting from TermFilter) for this, especially
those based on the default OpenBitSet (see the intersection count method)
because of your interest in stop words.
10k OpenBitSets for 39 M docs will probably not fit in memory in one go,
but that can be worked around by keeping fewer of them in memory.


For non stop words, you could also try using SortedVIntList instead
of OpenBitSet to reduce memory usage. In that case there is no direct
intersection count, but a counting iteration over the intersection can be
still done without actually forming the resulting filter.

Regards,
Paul Elschot


On Tuesday 17 March 2009 12:35:19 Adrian Dimulescu wrote:
> Ian Lea wrote:
> > Adrian - have you looked any further into why your original two term
> > query was too slow?  My experience is that simple queries are usually
> > extremely fast.  
> Let me first point out that it is not "too slow" in absolute terms, it 
> is only for my particular needs of attempting the number of 
> co-occurrences between ideally all non-noise terms (I plan about 10 k x 
> 10 k = 100 million calculations).
> > How large is the index?
> I indexed Wikipedia (the 8GB-XML dump you can download). The index size 
> is 4.4 GB. I have 39 million documents. The particularity is that I cut 
> Wikipedia in pararaphs and I consider each paragraph as a Document (not 
> one page per Document as usual). Which makes a lot of short documents. 
> Each document has a stored Id  and a non-stored analyzed body :
> 
>             doc.add(new Field("id", id, Store.YES, Index.NO));
>             doc.add(new Field("text", p, Store.NO, Index.ANALYZED));
> 
> > How many occurrences of your first or second
> > terms?  
> I do have in my index some words that are usually qualified as "stop" 
> words. My first two terms are "and" : 13M hits and "s" : 4M hits. I use 
> the SnowballAnalyzer in order to lemmatize words.
> 
> My intuition is that the large number of short documents and the fact I 
> am interested in the "stop" words do not help performance.
> 
> Thank you,
> Adrian.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 
>

Re: number of hits of pages containing two terms

Reply via email to