[
http://issues.apache.org/jira/browse/LUCENE-388?page=comments#action_12428527 ]
Yonik Seeley commented on LUCENE-388:
-------------------------------------
I was literally a minute away from committing my version when Doron sumbitted
his ;-)
Actually, I think I like Doron's "singleDocSegmentsCount" better.... it's
easier to understand at a glance.
I was testing the performance for mine... not as much of a speeup as I would
have liked...
5 to 6% better with maxBufferedDocs=1000, and a trivial single field document.
You need to go to maxBufferedDocs=10000 to see a good speedup, and that's
probably not advisable for most real indicies (and the maxBufferedDocs=1000
used much less memory and was slightly faster anyway).
Here is the code I added to IndexWriter to test my version (add
testInvariants() after add() call and after flushRamSegments() in close(), then
do "ant test")
private synchronized void testInvariants() {
// index segments should decrease in size
int maxSegLevel = 0;
for (int i=segmentInfos.size()-1; i>=0; i--) {
SegmentInfo si = segmentInfos.info(i);
int segLevel = (si.docCount)/minMergeDocs;
if (segLevel < maxSegLevel) {
throw new RuntimeException("Segment #" + i + " is too small. " +
segInfo());
}
maxSegLevel = Math.max(maxSegLevel,segLevel);
}
// check if merges needed
long targetMergeDocs = minMergeDocs;
int minSegment = segmentInfos.size();
while (targetMergeDocs <= maxMergeDocs && minSegment>=0) {
int mergeDocs = 0;
while (--minSegment >= 0) {
SegmentInfo si = segmentInfos.info(minSegment);
if (si.docCount >= targetMergeDocs) break;
mergeDocs += si.docCount;
}
if (mergeDocs >= targetMergeDocs) {
throw new RuntimeException("Merge needed at level "+targetMergeDocs + "
:"+segInfo());
}
targetMergeDocs *= mergeFactor; // increase target size
}
}
private String segInfo() {
StringBuffer sb = new
StringBuffer("minMergeDocs="+minMergeDocs+"docsLeftBeforeMerge="+docsLeftBeforeMerge+"
segsizes:");
for (int i=0; i<segmentInfos.size(); i++) {
sb.append(segmentInfos.info(i).docCount);
sb.append(",");
}
return sb.toString();
}
> [PATCH] IndexWriter.maybeMergeSegments() takes lots of CPU resources
> --------------------------------------------------------------------
>
> Key: LUCENE-388
> URL: http://issues.apache.org/jira/browse/LUCENE-388
> Project: Lucene - Java
> Issue Type: Bug
> Components: Index
> Affects Versions: CVS Nightly - Specify date in submission
> Environment: Operating System: Mac OS X 10.3
> Platform: Macintosh
> Reporter: Paul Smith
> Assigned To: Yonik Seeley
> Attachments: doron_IndexWriter.patch, IndexWriter.patch,
> log-compound.txt, log.optimized.deep.txt, log.optimized.txt, Lucene
> Performance Test - with & without hack.xls, lucene.34930.patch,
> yonik_indexwriter.diff, yonik_indexwriter.diff
>
>
> Note: I believe this to be the same situation with 1.4.3 as with SVN HEAD.
> Analysis using hprof utility shows that during index creation with many
> documents highlights that the CPU spends a large portion of it's time in
> IndexWriter.maybeMergeSegments(), which seems to be a 'waste' compared with
> other valuable CPU intensive operations such as tokenization etc.
> Using the following test snippet to retrieve some rows from the db and create
> an
> index:
> Analyzer a = new StandardAnalyzer();
> writer = new IndexWriter(indexDir, a, true);
> writer.setMergeFactor(1000);
> writer.setMaxBufferedDocs(10000);
> writer.setUseCompoundFile(false);
> connection = DriverManager.getConnection(
> "jdbc:inetdae7:tower.aconex.com?database=<somedb>", "secret",
> "squirrel");
> String sql = "select userid, userfirstname, userlastname, email from
> userx";
> LOG.info("sql=" + sql);
> Statement statement = connection.createStatement();
> statement.setFetchSize(5000);
> LOG.info("Executing sql");
> ResultSet rs = statement.executeQuery(sql);
> LOG.info("ResultSet retrieved");
> int row = 0;
> LOG.info("Indexing users");
> long begin = System.currentTimeMillis();
> while (rs.next()) {
> int userid = rs.getInt(1);
> String firstname = rs.getString(2);
> String lastname = rs.getString(3);
> String email = rs.getString(4);
> String fullName = firstname + " " + lastname;
> Document doc = new Document();
> doc.add(Field.Keyword("userid", userid+""));
> doc.add(Field.Keyword("firstname", firstname.toLowerCase()));
> doc.add(Field.Keyword("lastname", lastname.toLowerCase()));
> doc.add(Field.Text("name", fullName.toLowerCase()));
> doc.add(Field.Keyword("email", email.toLowerCase()));
> writer.addDocument(doc);
> row++;
> if((row % 100)==0){
> LOG.info(row + " indexed");
> }
> }
> double end = System.currentTimeMillis();
> double diff = (end-begin)/1000;
> double rate = row/diff;
> LOG.info("rate:" +rate);
> On my 1.5GHz PowerBook with 1.5Gb RAM and a 5400 RPM drive, my CPU is maxed
> out,
> and I end up getting a rate of indexing between 490-515 documents/second run
> over 10 times in succession.
> By applying a simple patch to IndexWriter (see attached shortly), which defers
> the calling of maybeMergeSegments() so that it is only called every 2000
> times(an arbitrary figure), I appear to get a new rate of between 945-970
> documents/second. Using Luke to look inside each index created between these
> 2
> there does not appear to be any difference. Same number of Documents, same
> number of Terms.
> I'm not suggesting one should apply this patch, I'm just highlighting the
> difference in performance that this sort of change gives you.
> We are about to use Lucene to index 4 million construction document records,
> and
> so speeding up the indexing process is in our best interest! :) If one
> considers the amount of CPU time spent in maybeMergeSegments over the initial
> index creation of 4 million documents, I think one could see how it would be
> ideal to try to speed this area up (at least move the bottleneck to IO).
> I woul appreciate anyone taking a moment to comment on this.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]