[ https://issues.apache.org/jira/browse/LUCENE-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless resolved LUCENE-1283. ---------------------------------------- Resolution: Fixed > Factor out ByteSliceWriter from DocumentsWriterFieldData > -------------------------------------------------------- > > Key: LUCENE-1283 > URL: https://issues.apache.org/jira/browse/LUCENE-1283 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Affects Versions: 2.3, 2.3.1 > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.4 > > Attachments: LUCENE-1283.patch > > > DocumentsWriter uses byte slices into shared byte[]'s to hold the > growing postings data for many different terms in memory. This is > probably the trickiest (most confusing) part of DocumentsWriter. > Right now it's not cleanly factored out and not easy to separately > test. In working on this issue: > > http://mail-archives.apache.org/mod_mbox/lucene-java-user/200805.mbox/[EMAIL > PROTECTED] > which eventually turned out to be a bug in Oracle JRE's JIT compiler, > I factored out ByteSliceWriter and created a unit test to stress test > the writing & reading of byte slices. The test just randomly writes N > streams interleaved into shared byte[]'s, then reads them back > verifying the results are correct. > I created the stress test to try to find any bugs in that code. The > test ran fine (no bugs were found) but I think the refactoring is > still very much worthwhile. > I expected the changes to reduce indexing throughput, so I ran a test > indexing first 200K Wikipedia docs using this alg: > {code} > analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer > doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker > docs.file=/Volumes/External/lucene/wiki.txt > doc.stored = true > doc.term.vector = true > doc.add.log.step=2000 > directory=FSDirectory > autocommit=false > compound=true > ram.flush.mb=256 > { "Rounds" > ResetSystemErase > { "BuildIndex" > - CreateIndex > { "AddDocs" AddDoc > : 200000 > - CloseIndex > } > NewRound > } : 4 > RepSumByPrefRound BuildIndex > {code} > Ok trunk it produces these results: > {code} > Operation round runCnt recsPerRun rec/s elapsedSec > avgUsedMem avgTotalMem > BuildIndex 0 1 200000 791.7 252.63 > 338,552,096 1,061,814,272 > BuildIndex - - 1 - - 1 - - 200000 - - 793.1 - - 252.18 - > 605,262,080 1,061,814,272 > BuildIndex 2 1 200000 794.8 251.63 > 601,966,528 1,061,814,272 > BuildIndex - - 3 - - 1 - - 200000 - - 782.5 - - 255.58 - > 608,699,712 1,061,814,272 > {code} > and with the patch: > {code} > Operation round runCnt recsPerRun rec/s elapsedSec > avgUsedMem avgTotalMem > BuildIndex 0 1 200000 745.0 268.47 > 338,318,784 1,061,814,272 > BuildIndex - - 1 - - 1 - - 200000 - - 792.7 - - 252.30 - > 605,331,776 1,061,814,272 > BuildIndex 2 1 200000 786.7 254.24 > 602,915,712 1,061,814,272 > BuildIndex - - 3 - - 1 - - 200000 - - 795.3 - - 251.48 - > 602,378,624 1,061,814,272 > {code} > So it looks like the performance cost of this change is negligible (in > the noise). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]