Thanks Chuck! I missed the call: getIndexOffset. I am profiling it again to pin-point where the performance problem is.
-John On Tue, 23 Nov 2004 16:13:22 -0800, Chuck Williams <[EMAIL PROTECTED]> wrote: > Are you sure you have a performance problem with > TermInfosReader.get(Term)? It looks to me like it scans sequentially > only within a small buffer window (of size > SegmentTermEnum.indexInterval) and that it uses binary search otherwise. > See TermInfosReader.getIndexOffset(Term). > > Chuck > > > > > -----Original Message----- > > From: John Wang [mailto:[EMAIL PROTECTED] > > Sent: Tuesday, November 23, 2004 3:38 PM > > To: [EMAIL PROTECTED] > > Subject: URGENT: Help indexing large document set > > > > Hi: > > > > I am trying to index 1M documents, with batches of 500 documents. > > > > Each document has an unique text key, which is added as a > > Field.KeyWord(name,value). > > > > For each batch of 500, I need to make sure I am not adding a > > document with a key that is already in the current index. > > > > To do this, I am calling IndexSearcher.docFreq for each document > and > > delete the document currently in the index with the same key: > > > > while (keyIter.hasNext()) { > > String objectID = (String) keyIter.next(); > > term = new Term("key", objectID); > > int count = localSearcher.docFreq(term); > > > > if (count != 0) { > > localReader.delete(term); > > } > > } > > > > Then I proceed with adding the documents. > > > > This turns out to be extremely expensive, I looked into the code and > I > > see in > > TermInfosReader.get(Term term) it is doing a linear look up for each > > term. So as the index grows, the above operation degrades at a > linear > > rate. So for each commit, we are doing a docFreq for 500 documents. > > > > I also tried to create a BooleanQuery composed of 500 TermQueries > and > > do 1 search for each batch, and the performance didn't get better. > And > > if the batch size increases to say 50,000, creating a BooleanQuery > > composed of 50,000 TermQuery instances may introduce huge memory > > costs. > > > > Is there a better way to do this? > > > > Can TermInfosReader.get(Term term) be optimized to do a binary > lookup > > instead of a linear walk? Of course that depends on whether the > terms > > are stored in sorted order, are they? > > > > This is very urgent, thanks in advance for all your help. > > > > -John > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
