[ https://issues.apache.org/jira/browse/LUCENE-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated LUCENE-1120: --------------------------------------- Attachment: LUCENE-1120.patch Attached patch. All tests pass. (Note that the TestBackwardsCompatibility test will fail if you apply the patch because the new *.zip files I added aren't in the patch). I think we should commit this for 2.3? It's a sizable gain in merging performance. > Use bulk-byte-copy when merging term vectors > -------------------------------------------- > > Key: LUCENE-1120 > URL: https://issues.apache.org/jira/browse/LUCENE-1120 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Attachments: LUCENE-1120.patch > > > Indexing all of Wikipedia, with term vectors on, under the YourKit > profiler, shows that 26% of the time (!!) was spent merging the > vectors. This was without offsets & positions, which would make > matters even worse. > Depressingly, merging, even with ConcurrentMergeScheduler, cannot in > fact keep up with the flushing of new segments in this test, and this > is on a strong IO system (Mac Pro with 4 drive RAID 0 array, 4 CPU > cores). > So, just like Robert's idea to merge stored fields with bulk copying > whenever the field name->number mapping is "congruent" (LUCENE-1043), > we can do the same with term vectors. > It's a little trickier because the term vectors format doesn't quite > make it easy to bulk-copy because it doesn't directly encode the > offset into the tvf file. > I worked out a patch that changes the tvx format slightly, by storing > the absolute position in the tvf file for the start of each document > into the tvx file, just like it does for tvd now. This adds an extra > 8 bytes (long) in the tvx file, per document. > Then, I removed a vLong (the first "position" stored inside the tvd > file), which makes tvd contents fully position independent (so you can > just copy the bytes). > This adds up to 7 bytes per document (less for larger indices) that > have term vectors enabled, but I think this small increase in index > size is acceptable for the gains in indexing performance? > With this change, the time spent merging term vectors dropped from 26% > to 3%. Of course, this only applies if your documents are "regular". > I think in the future we could have Lucene try hard to assign the same > field number for a given field name, if it had been seen before in the > index... > Merging terms now dominates the merge cost (~20% over overall time > building the Wikipedia index). > I also beefed up TestBackwardsCompatibility unit test: test a non-CFS > and a CFS of versions 1.9, 2.0, 2.1, 2.2 index formats, and added some > term vector fields to these indices. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]