Re: number of hits of pages containing two terms

Adrian Dimulescu Tue, 17 Mar 2009 04:35:43 -0700

Ian Lea wrote:

Adrian - have you looked any further into why your original two term
query was too slow?  My experience is that simple queries are usually

extremely fast.

Let me first point out that it is not "too slow" in absolute terms, itis only for my particular needs of attempting the number ofco-occurrences between ideally all non-noise terms (I plan about 10 k x10 k = 100 million calculations).

How large is the index?

I indexed Wikipedia (the 8GB-XML dump you can download). The index sizeis 4.4 GB. I have 39 million documents. The particularity is that I cutWikipedia in pararaphs and I consider each paragraph as a Document (notone page per Document as usual). Which makes a lot of short documents.Each document has a stored Id and a non-stored analyzed body :


           doc.add(new Field("id", id, Store.YES, Index.NO));
           doc.add(new Field("text", p, Store.NO, Index.ANALYZED));

How many occurrences of your first or second

terms?

I do have in my index some words that are usually qualified as "stop"words. My first two terms are "and" : 13M hits and "s" : 4M hits. I usethe SnowballAnalyzer in order to lemmatize words.

My intuition is that the large number of short documents and the fact Iam interested in the "stop" words do not help performance.


Thank you,
Adrian.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: number of hits of pages containing two terms

Reply via email to