Hi:
I am trying to index 1M documents, with batches of 500 documents.
Each document has an unique text key, which is added as a
Field.KeyWord(name,value).
For each batch of 500, I need to make sure I am not adding a
document with a key that is already in the current index.
To do this, I am calling IndexSearcher.docFreq for each document and
delete the document currently in the index with the same key:
while (keyIter.hasNext()) {
String objectID = (String) keyIter.next();
term = new Term("key", objectID);
int count = localSearcher.docFreq(term);
if (count != 0) {
localReader.delete(term);
}
}
Then I proceed with adding the documents.
This turns out to be extremely expensive, I looked into the code and I see in
TermInfosReader.get(Term term) it is doing a linear look up for each
term. So as the index grows, the above operation degrades at a linear
rate. So for each commit, we are doing a docFreq for 500 documents.
I also tried to create a BooleanQuery composed of 500 TermQueries and
do 1 search for each batch, and the performance didn't get better. And
if the batch size increases to say 50,000, creating a BooleanQuery
composed of 50,000 TermQuery instances may introduce huge memory
costs.
Is there a better way to do this?
Can TermInfosReader.get(Term term) be optimized to do a binary lookup
instead of a linear walk? Of course that depends on whether the terms
are stored in sorted order, are they?
This is very urgent, thanks in advance for all your help.
-John
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]