Hi,
I am trying to find a way to handle the wildcard queries in Lucene without going out
of memory and have been having some problems with it.
I have modified some parts in search part of Lucene to just keep only about 1000 terms
in memory and write the rest of the terms to a file (this is done in the getQuery()
method of MultiTermQuery.java, PrefixQuery.java, etc.).
Then when we create scorer objects and collect scores for each clause in the score()
method of the BooleanScorer.java, after all the clauses (that are in memory) are
processed, then I continue reading from the file that I created earlier. I read out
each term from the file and create a TermQuery, then get the scorer object from this
TermQuery and collect the score for it.
Then the bucketTable will do collectHits of everything.
I have tested out my changes with small indexes with about 2 terms in memory and about
2 or 3 terms in the file, and it worked fine.
However, when I tried this out with bigger indexes (> 1 million docs) and with 1000 in
memory and 972 in the file, I got into an infinite loop when doing
bucketTable.collectHits(). I printed out the doc in each bucket and noticed that
about half way through the bucket list, it started to have about 4 - 5 repeated docs
in the rest of the list and there was no null at the end of the list to end it.
I have looked at everywhere and even tried to increase the bucket table size to be the
sum of the number of terms in memory and number of terms in the file. But that still
did not work.
I would really appreciate any suggestions/ideas/help on this.
Thanks.
Javier
---------------------------------
Do you Yahoo!?
Read only the mail you want - Yahoo! Mail SpamGuard.