Ian Lea wrote:
Adrian - have you looked any further into why your original two term
query was too slow? My experience is that simple queries are usually
extremely fast.
Let me first point out that it is not "too slow" in absolute terms, it
is only for my particular needs of attempting the number of
co-occurrences between ideally all non-noise terms (I plan about 10 k x
10 k = 100 million calculations).
How large is the index?
I indexed Wikipedia (the 8GB-XML dump you can download). The index size
is 4.4 GB. I have 39 million documents. The particularity is that I cut
Wikipedia in pararaphs and I consider each paragraph as a Document (not
one page per Document as usual). Which makes a lot of short documents.
Each document has a stored Id and a non-stored analyzed body :
doc.add(new Field("id", id, Store.YES, Index.NO));
doc.add(new Field("text", p, Store.NO, Index.ANALYZED));
How many occurrences of your first or second
terms?
I do have in my index some words that are usually qualified as "stop"
words. My first two terms are "and" : 13M hits and "s" : 4M hits. I use
the SnowballAnalyzer in order to lemmatize words.
My intuition is that the large number of short documents and the fact I
am interested in the "stop" words do not help performance.
Thank you,
Adrian.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org