OK - thanks for the explanation. So this is not just a simple search ... I'll go away and leave you and Michael and the other experts to talk about clever solutions.
-- Ian. On Tue, Mar 17, 2009 at 11:35 AM, Adrian Dimulescu <adrian.dimule...@gmail.com> wrote: > Ian Lea wrote: >> >> Adrian - have you looked any further into why your original two term >> query was too slow? My experience is that simple queries are usually >> extremely fast. > > Let me first point out that it is not "too slow" in absolute terms, it is > only for my particular needs of attempting the number of co-occurrences > between ideally all non-noise terms (I plan about 10 k x 10 k = 100 million > calculations). >> >> How large is the index? > > I indexed Wikipedia (the 8GB-XML dump you can download). The index size is > 4.4 GB. I have 39 million documents. The particularity is that I cut > Wikipedia in pararaphs and I consider each paragraph as a Document (not one > page per Document as usual). Which makes a lot of short documents. Each > document has a stored Id and a non-stored analyzed body : > > doc.add(new Field("id", id, Store.YES, Index.NO)); > doc.add(new Field("text", p, Store.NO, Index.ANALYZED)); > >> How many occurrences of your first or second >> terms? > > I do have in my index some words that are usually qualified as "stop" words. > My first two terms are "and" : 13M hits and "s" : 4M hits. I use the > SnowballAnalyzer in order to lemmatize words. > > My intuition is that the large number of short documents and the fact I am > interested in the "stop" words do not help performance. > > Thank you, > Adrian. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org