[jira] Commented: (LUCENE-528) Optimization for IndexWriter.addIndexes()

Ning Li (JIRA) Wed, 16 Aug 2006 12:31:20 -0700

    [ 
http://issues.apache.org/jira/browse/LUCENE-528?page=comments#action_12428478 ] 
            
Ning Li commented on LUCENE-528:
--------------------------------


In an email thread titled "LUCENE-528 and 565", I described a weakness of the 
proposed solution:

"I'm totally for a version of addIndexes() where optimize() is not always 
called. However, with the one proposed in the patch, we could end up with an 
index where: segment 0 has 1000 docs, 1 has 2000, 2 has 4000, 3 has 8000, etc. 
while Lucene desires the reverse. Or we could have a sandwich index where: 
segment 0 has 4000 docs, 1 has 100, 2 has 100, 3 has 4000. While neither of 
these will occur if you use addIndexesNoOpt() carefully, there should be a more 
robust merge policy."

Here is an alternative solution which merges segements so that the docCount of 
segment i is at least twice as big as the docCount of segment i+1. If we are 
willing to make it a bit more complicated, we can take merge factor into 
consideration.


  public synchronized void addIndexesNoOpt(Directory[] dirs) throws IOException 
{
    for (int i = 0; i < dirs.length; i++) {
      SegmentInfos sis = new SegmentInfos(); // read infos from dir
      sis.read(dirs[i]);
      for (int j = 0; j < sis.size(); j++) {
        segmentInfos.addElement(sis.info(j)); // add each info
      }
    }

    int start = 0;
    int docCountFromStart = docCount();

    while (start < segmentInfos.size()) {
      int end;
      int docCountToMerge = 0;

      if (docCountFromStart <= minMergeDocs) {
        // if the total docCount of the remaining segments
        // is lte minMergeDocs, merge all of them
        end = segmentInfos.size() - 1;
        docCountToMerge = docCountFromStart;
      }
      else {
        // otherwise, merge some segments so that the docCount
        // of these segments is at least half of the remaining
        for (end = start; end < segmentInfos.size(); end++) {
          docCountToMerge += segmentInfos.info(end).docCount;
          if (docCountToMerge >= docCountFromStart / 2) {
            break;
          }
        }
      }
      
      mergeSegments(start, end + 1);
      start++;
      docCountFromStart -= docCountToMerge;
    }
  }


> Optimization for IndexWriter.addIndexes()
> -----------------------------------------
>
>                 Key: LUCENE-528
>                 URL: http://issues.apache.org/jira/browse/LUCENE-528
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Steven Tamm
>         Assigned To: Otis Gospodnetic
>            Priority: Minor
>         Attachments: AddIndexes.patch
>
>
> One big performance problem with IndexWriter.addIndexes() is that it has to 
> optimize the index both before and after adding the segments.  When you have 
> a very large index, to which you are adding batches of small updates, these 
> calls to optimize make using addIndexes() impossible.  It makes parallel 
> updates very frustrating.
> Here is an optimized function that helps out by calling mergeSegments only on 
> the newly added documents.  It will try to avoid calling mergeSegments until 
> the end, unless you're adding a lot of documents at once.
> I also have an extensive unit test that verifies that this function works 
> correctly if people are interested.  I gave it a different name because it 
> has very different performance characteristics which can make querying take 
> longer.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-528) Optimization for IndexWriter.addIndexes()

Reply via email to