Ian Lea wrote:
Adrian - have you looked any further into why your original two term
query was too slow?  My experience is that simple queries are usually
extremely fast.
Let me first point out that it is not "too slow" in absolute terms, it is only for my particular needs of attempting the number of co-occurrences between ideally all non-noise terms (I plan about 10 k x 10 k = 100 million calculations).
How large is the index?
I indexed Wikipedia (the 8GB-XML dump you can download). The index size is 4.4 GB. I have 39 million documents. The particularity is that I cut Wikipedia in pararaphs and I consider each paragraph as a Document (not one page per Document as usual). Which makes a lot of short documents. Each document has a stored Id and a non-stored analyzed body :

           doc.add(new Field("id", id, Store.YES, Index.NO));
           doc.add(new Field("text", p, Store.NO, Index.ANALYZED));

How many occurrences of your first or second
terms?
I do have in my index some words that are usually qualified as "stop" words. My first two terms are "and" : 13M hits and "s" : 4M hits. I use the SnowballAnalyzer in order to lemmatize words.

My intuition is that the large number of short documents and the fact I am interested in the "stop" words do not help performance.

Thank you,
Adrian.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to