Re: number of hits of pages containing two terms

Ian Lea Tue, 17 Mar 2009 05:00:19 -0700

OK - thanks for the explanation.  So this is not just a simple search ...

I'll go away and leave you and Michael and the other experts to talk
about clever solutions.



--
Ian.


On Tue, Mar 17, 2009 at 11:35 AM, Adrian Dimulescu
<adrian.dimule...@gmail.com> wrote:
> Ian Lea wrote:
>>
>> Adrian - have you looked any further into why your original two term
>> query was too slow?  My experience is that simple queries are usually
>> extremely fast.
>
> Let me first point out that it is not "too slow" in absolute terms, it is
> only for my particular needs of attempting the number of co-occurrences
> between ideally all non-noise terms (I plan about 10 k x 10 k = 100 million
> calculations).
>>
>> How large is the index?
>
> I indexed Wikipedia (the 8GB-XML dump you can download). The index size is
> 4.4 GB. I have 39 million documents. The particularity is that I cut
> Wikipedia in pararaphs and I consider each paragraph as a Document (not one
> page per Document as usual). Which makes a lot of short documents. Each
> document has a stored Id  and a non-stored analyzed body :
>
>           doc.add(new Field("id", id, Store.YES, Index.NO));
>           doc.add(new Field("text", p, Store.NO, Index.ANALYZED));
>
>> How many occurrences of your first or second
>> terms?
>
> I do have in my index some words that are usually qualified as "stop" words.
> My first two terms are "and" : 13M hits and "s" : 4M hits. I use the
> SnowballAnalyzer in order to lemmatize words.
>
> My intuition is that the large number of short documents and the fact I am
> interested in the "stop" words do not help performance.
>
> Thank you,
> Adrian.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: number of hits of pages containing two terms

Reply via email to