Are you sure you have a performance problem with TermInfosReader.get(Term)? It looks to me like it scans sequentially only within a small buffer window (of size SegmentTermEnum.indexInterval) and that it uses binary search otherwise. See TermInfosReader.getIndexOffset(Term).
Chuck > -----Original Message----- > From: John Wang [mailto:[EMAIL PROTECTED] > Sent: Tuesday, November 23, 2004 3:38 PM > To: [EMAIL PROTECTED] > Subject: URGENT: Help indexing large document set > > Hi: > > I am trying to index 1M documents, with batches of 500 documents. > > Each document has an unique text key, which is added as a > Field.KeyWord(name,value). > > For each batch of 500, I need to make sure I am not adding a > document with a key that is already in the current index. > > To do this, I am calling IndexSearcher.docFreq for each document and > delete the document currently in the index with the same key: > > while (keyIter.hasNext()) { > String objectID = (String) keyIter.next(); > term = new Term("key", objectID); > int count = localSearcher.docFreq(term); > > if (count != 0) { > localReader.delete(term); > } > } > > Then I proceed with adding the documents. > > This turns out to be extremely expensive, I looked into the code and I > see in > TermInfosReader.get(Term term) it is doing a linear look up for each > term. So as the index grows, the above operation degrades at a linear > rate. So for each commit, we are doing a docFreq for 500 documents. > > I also tried to create a BooleanQuery composed of 500 TermQueries and > do 1 search for each batch, and the performance didn't get better. And > if the batch size increases to say 50,000, creating a BooleanQuery > composed of 50,000 TermQuery instances may introduce huge memory > costs. > > Is there a better way to do this? > > Can TermInfosReader.get(Term term) be optimized to do a binary lookup > instead of a linear walk? Of course that depends on whether the terms > are stored in sorted order, are they? > > This is very urgent, thanks in advance for all your help. > > -John > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
