[ http://issues.apache.org/jira/browse/LUCENE-528?page=comments#action_12428478 ] Ning Li commented on LUCENE-528: --------------------------------
In an email thread titled "LUCENE-528 and 565", I described a weakness of the proposed solution: "I'm totally for a version of addIndexes() where optimize() is not always called. However, with the one proposed in the patch, we could end up with an index where: segment 0 has 1000 docs, 1 has 2000, 2 has 4000, 3 has 8000, etc. while Lucene desires the reverse. Or we could have a sandwich index where: segment 0 has 4000 docs, 1 has 100, 2 has 100, 3 has 4000. While neither of these will occur if you use addIndexesNoOpt() carefully, there should be a more robust merge policy." Here is an alternative solution which merges segements so that the docCount of segment i is at least twice as big as the docCount of segment i+1. If we are willing to make it a bit more complicated, we can take merge factor into consideration. public synchronized void addIndexesNoOpt(Directory[] dirs) throws IOException { for (int i = 0; i < dirs.length; i++) { SegmentInfos sis = new SegmentInfos(); // read infos from dir sis.read(dirs[i]); for (int j = 0; j < sis.size(); j++) { segmentInfos.addElement(sis.info(j)); // add each info } } int start = 0; int docCountFromStart = docCount(); while (start < segmentInfos.size()) { int end; int docCountToMerge = 0; if (docCountFromStart <= minMergeDocs) { // if the total docCount of the remaining segments // is lte minMergeDocs, merge all of them end = segmentInfos.size() - 1; docCountToMerge = docCountFromStart; } else { // otherwise, merge some segments so that the docCount // of these segments is at least half of the remaining for (end = start; end < segmentInfos.size(); end++) { docCountToMerge += segmentInfos.info(end).docCount; if (docCountToMerge >= docCountFromStart / 2) { break; } } } mergeSegments(start, end + 1); start++; docCountFromStart -= docCountToMerge; } } > Optimization for IndexWriter.addIndexes() > ----------------------------------------- > > Key: LUCENE-528 > URL: http://issues.apache.org/jira/browse/LUCENE-528 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Reporter: Steven Tamm > Assigned To: Otis Gospodnetic > Priority: Minor > Attachments: AddIndexes.patch > > > One big performance problem with IndexWriter.addIndexes() is that it has to > optimize the index both before and after adding the segments. When you have > a very large index, to which you are adding batches of small updates, these > calls to optimize make using addIndexes() impossible. It makes parallel > updates very frustrating. > Here is an optimized function that helps out by calling mergeSegments only on > the newly added documents. It will try to avoid calling mergeSegments until > the end, unless you're adding a lot of documents at once. > I also have an extensive unit test that verifies that this function works > correctly if people are interested. I gave it a different name because it > has very different performance characteristics which can make querying take > longer. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]