[
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486335
]
Michael McCandless commented on LUCENE-843:
-------------------------------------------
Last is the results for small docs (100 tokens = ~550 bytes plain text each):
2000000 DOCS @ ~550 bytes plain text
RAM = 32 MB
NUM THREADS = 1
MERGE FACTOR = 10
No term vectors nor stored fields
AUTOCOMMIT = true (commit whenever RAM is full)
old
2000000 docs in 886.7 secs
index size = 438M
new
2000000 docs in 230.5 secs
index size = 435M
Total Docs/sec: old 2255.6; new 8676.4 [ 284.7% faster]
Docs/MB @ flush: old 128.0; new 4194.6 [ 3176.2% more]
Avg RAM used (MB) @ flush: old 107.3; new 37.7 [ 64.9% less]
AUTOCOMMIT = false (commit only once at the end)
old
2000000 docs in 888.7 secs
index size = 438M
new
2000000 docs in 239.6 secs
index size = 432M
Total Docs/sec: old 2250.5; new 8348.7 [ 271.0% faster]
Docs/MB @ flush: old 128.0; new 4146.8 [ 3138.9% more]
Avg RAM used (MB) @ flush: old 108.1; new 38.9 [ 64.0% less]
With term vectors (positions + offsets) and 2 small stored fields
AUTOCOMMIT = true (commit whenever RAM is full)
old
2000000 docs in 1480.1 secs
index size = 2.1G
new
2000000 docs in 462.0 secs
index size = 2.1G
Total Docs/sec: old 1351.2; new 4329.3 [ 220.4% faster]
Docs/MB @ flush: old 93.1; new 4194.6 [ 4405.7% more]
Avg RAM used (MB) @ flush: old 296.4; new 38.3 [ 87.1% less]
AUTOCOMMIT = false (commit only once at the end)
old
2000000 docs in 1489.4 secs
index size = 2.1G
new
2000000 docs in 347.9 secs
index size = 2.1G
Total Docs/sec: old 1342.8; new 5749.4 [ 328.2% faster]
Docs/MB @ flush: old 93.1; new 4146.8 [ 4354.5% more]
Avg RAM used (MB) @ flush: old 297.1; new 38.6 [ 87.0% less]
200000 DOCS @ ~5,500 bytes plain text
No term vectors nor stored fields
AUTOCOMMIT = true (commit whenever RAM is full)
old
200000 docs in 397.6 secs
index size = 415M
new
200000 docs in 167.5 secs
index size = 411M
Total Docs/sec: old 503.1; new 1194.1 [ 137.3% faster]
Docs/MB @ flush: old 81.6; new 406.2 [ 397.6% more]
Avg RAM used (MB) @ flush: old 87.3; new 35.2 [ 59.7% less]
AUTOCOMMIT = false (commit only once at the end)
old
200000 docs in 394.6 secs
index size = 415M
new
200000 docs in 168.4 secs
index size = 408M
Total Docs/sec: old 506.9; new 1187.7 [ 134.3% faster]
Docs/MB @ flush: old 81.6; new 432.2 [ 429.4% more]
Avg RAM used (MB) @ flush: old 126.6; new 36.9 [ 70.8% less]
With term vectors (positions + offsets) and 2 small stored fields
AUTOCOMMIT = true (commit whenever RAM is full)
old
200000 docs in 754.2 secs
index size = 1.7G
new
200000 docs in 304.9 secs
index size = 1.7G
Total Docs/sec: old 265.2; new 656.0 [ 147.4% faster]
Docs/MB @ flush: old 46.7; new 406.2 [ 769.6% more]
Avg RAM used (MB) @ flush: old 92.9; new 35.2 [ 62.1% less]
AUTOCOMMIT = false (commit only once at the end)
old
200000 docs in 743.9 secs
index size = 1.7G
new
200000 docs in 244.3 secs
index size = 1.7G
Total Docs/sec: old 268.9; new 818.7 [ 204.5% faster]
Docs/MB @ flush: old 46.7; new 432.2 [ 825.2% more]
Avg RAM used (MB) @ flush: old 93.0; new 36.6 [ 60.6% less]
> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 2.2
> Reporter: Michael McCandless
> Assigned To: Michael McCandless
> Priority: Minor
> Attachments: LUCENE-843.patch, LUCENE-843.take2.patch,
> LUCENE-843.take3.patch, LUCENE-843.take4.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents. I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
> * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
> * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges. Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
> * Recycle objects/buffers to reduce time/stress in GC.
> * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]