RE: URGENT: Help indexing large document set

Chuck Williams Wed, 24 Nov 2004 12:33:03 -0800

Does keyIter return the keys in sorted order?  This should reduce seeks,
especially if the keys are dense.


Also, you should be able to localReader.delete(term) instead of
iterating over the docs (of which I presume there is only one doc since
keys are unique).  This won't improve performance as
IndexReader.delete(Term) does exactly what your code does, but it will
be cleaner.

A linear slowdown with number of docs doesn't make sense, so something
else must be wrong.  I'm not sure what the default buffer size is (it
appears it used to be 128 but is dynamic now I think).  You might find
the slowdown stops after a certain point, especially if you increase
your batch size.

Chuck

  > -----Original Message-----
  > From: John Wang [mailto:[EMAIL PROTECTED]
  > Sent: Wednesday, November 24, 2004 12:21 PM
  > To: Lucene Users List
  > Subject: Re: URGENT: Help indexing large document set
  > 
  > Thanks Paul!
  > 
  > Using your suggestion, I have changed the update check code to use
  > only the indexReader:
  > 
  > try {
  >               localReader = IndexReader.open(path);
  > 
  >               while (keyIter.hasNext()) {
  >                 key = (String) keyIter.next();
  >                 term = new Term("key", key);
  >                 TermDocs tDocs = localReader.termDocs(term);
  >                 if (tDocs != null) {
  >                   try {
  >                     while (tDocs.next()) {
  >                       localReader.delete(tDocs.doc());
  >                     }
  >                   } finally {
  >                     tDocs.close();
  >                   }
  >                 }
  >               }
  >             } finally {
  > 
  >               if (localReader != null) {
  >                 localReader.close();
  >               }
  > 
  >             }
  > 
  > 
  > Unfortunately it didn't seem to make any dramatic difference.
  > 
  > I also see the CPU is only 30-50% busy, so I am guessing it's
spending
  > a lot of time in IO. Anyway of making the CPU work harder?
  > 
  > Is batch size of 500 too small for 1 million documents?
  > 
  > Currently I am seeing a linear speed degredation of 0.3 milliseconds
  > per document.
  > 
  > Thanks
  > 
  > -John
  > 
  > 
  > On Wed, 24 Nov 2004 09:05:39 +0100, Paul Elschot
  > <[EMAIL PROTECTED]> wrote:
  > > On Wednesday 24 November 2004 00:37, John Wang wrote:
  > >
  > >
  > > > Hi:
  > > >
  > > >    I am trying to index 1M documents, with batches of 500
documents.
  > > >
  > > >    Each document has an unique text key, which is added as a
  > > > Field.KeyWord(name,value).
  > > >
  > > >    For each batch of 500, I need to make sure I am not adding a
  > > > document with a key that is already in the current index.
  > > >
  > > >   To do this, I am calling IndexSearcher.docFreq for each
document
  > and
  > > > delete the document currently in the index with the same key:
  > > >
  > > >        while (keyIter.hasNext()) {
  > > >             String objectID = (String) keyIter.next();
  > > >             term = new Term("key", objectID);
  > > >             int count = localSearcher.docFreq(term);
  > >
  > > To speed this up a bit make sure that the iterator gives
  > > the terms in sorted order. I'd use an index reader instead
  > > of a searcher, but that will probably not make a difference.
  > >
  > > Adding the documents can be done with multiple threads.
  > > Last time I checked that, there was a moderate speed up
  > > using three threads instead of one on a single CPU machine.
  > > Tuning the values of minMergeDocs and maxMergeDocs
  > > may also help to increase performance of adding documents.
  > >
  > > Regards,
  > > Paul Elschot
  > >
  > >
---------------------------------------------------------------------
  > >
  > >
  > > To unsubscribe, e-mail: [EMAIL PROTECTED]
  > > For additional commands, e-mail:
[EMAIL PROTECTED]
  > >
  > >
  > 
  >
---------------------------------------------------------------------
  > To unsubscribe, e-mail: [EMAIL PROTECTED]
  > For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: URGENT: Help indexing large document set

Reply via email to