[ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486332 ]
Michael McCandless commented on LUCENE-843: ------------------------------------------- A couple more details on the testing: I run java -server to get all optimizations in the JVM, and the IO system is a local OS X RAID 0 of 4 SATA drives. Using the above tool I ran an initial set of benchmarks comparing old (= Lucene trunk) vs new (= this patch), varying document size (~550 bytes to ~5,500 bytes to ~55,000 bytes of plain text from Europarl "en"). For each document size I run 4 combinations of whether term vectors and stored fields are on or off and whether autoCommit is true or false. I measure net docs/sec (= total # docs indexed divided by total time taken), RAM efficiency (= avg # docs flushed with each flush divided by RAM buffer size), and avg HEAP RAM usage before each flush. Here are the results for the 10K tokens (= ~55,000 bytes plain text) per document: 20000 DOCS @ ~55,000 bytes plain text RAM = 32 MB NUM THREADS = 1 MERGE FACTOR = 10 No term vectors nor stored fields AUTOCOMMIT = true (commit whenever RAM is full) old 20000 docs in 200.3 secs index size = 358M new 20000 docs in 126.0 secs index size = 356M Total Docs/sec: old 99.8; new 158.7 [ 59.0% faster] Docs/MB @ flush: old 24.2; new 49.1 [ 102.5% more] Avg RAM used (MB) @ flush: old 74.5; new 36.2 [ 51.4% less] AUTOCOMMIT = false (commit only once at the end) old 20000 docs in 202.7 secs index size = 358M new 20000 docs in 120.0 secs index size = 354M Total Docs/sec: old 98.7; new 166.7 [ 69.0% faster] Docs/MB @ flush: old 24.2; new 48.9 [ 101.7% more] Avg RAM used (MB) @ flush: old 74.3; new 37.0 [ 50.2% less] With term vectors (positions + offsets) and 2 small stored fields AUTOCOMMIT = true (commit whenever RAM is full) old 20000 docs in 374.7 secs index size = 1.4G new 20000 docs in 236.1 secs index size = 1.4G Total Docs/sec: old 53.4; new 84.7 [ 58.7% faster] Docs/MB @ flush: old 10.2; new 49.1 [ 382.8% more] Avg RAM used (MB) @ flush: old 129.3; new 36.6 [ 71.7% less] AUTOCOMMIT = false (commit only once at the end) old 20000 docs in 385.7 secs index size = 1.4G new 20000 docs in 182.8 secs index size = 1.4G Total Docs/sec: old 51.9; new 109.4 [ 111.0% faster] Docs/MB @ flush: old 10.2; new 48.9 [ 380.9% more] Avg RAM used (MB) @ flush: old 76.0; new 37.3 [ 50.9% less] > improve how IndexWriter uses RAM to buffer added documents > ---------------------------------------------------------- > > Key: LUCENE-843 > URL: https://issues.apache.org/jira/browse/LUCENE-843 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Affects Versions: 2.2 > Reporter: Michael McCandless > Assigned To: Michael McCandless > Priority: Minor > Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, > LUCENE-843.take3.patch, LUCENE-843.take4.patch > > > I'm working on a new class (MultiDocumentWriter) that writes more than > one document directly into a single Lucene segment, more efficiently > than the current approach. > This only affects the creation of an initial segment from added > documents. I haven't changed anything after that, eg how segments are > merged. > The basic ideas are: > * Write stored fields and term vectors directly to disk (don't > use up RAM for these). > * Gather posting lists & term infos in RAM, but periodically do > in-RAM merges. Once RAM is full, flush buffers to disk (and > merge them later when it's time to make a real segment). > * Recycle objects/buffers to reduce time/stress in GC. > * Other various optimizations. > Some of these changes are similar to how KinoSearch builds a segment. > But, I haven't made any changes to Lucene's file format nor added > requirements for a global fields schema. > So far the only externally visible change is a new method > "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is > deprecated) so that it flushes according to RAM usage and not a fixed > number documents added. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]