Adrien Grand created LUCENE-10217:
-------------------------------------
Summary: BufferedUpdates is memory inefficient
Key: LUCENE-10217
URL: https://issues.apache.org/jira/browse/LUCENE-10217
Project: Lucene - Core
Issue Type: Improvement
Reporter: Adrien Grand
I recently got a question from [~David Turner] about why {{IndexWriter}} was
flushing data so frequently despite very small documents. After investigating,
we noticed that most of the RAM buffer was actually spent on BufferedUpdates
since his test was using {{IndexWriter#updateDocument}}. This is not surprising
given that BufferedUpdates accounts BYTES_PER_DEL_TERM=160 bytes per update,
plus the length of the field and the length of the term, so often around 200
bytes only to record the updated term.
As a comparison, Lucene's nightly NYC taxis benchmark only needs 286 bytes per
document in the RAM buffer for about 20 fields,
(http://people.apache.org/~mikemccand/lucenebench/sparseResults.html#index_docs_per_mb_ram),
or ~15 bytes per field.
Updates are expected to be slower than appending given that they need to look
up terms in the dictionary, but I suspect that this memory inefficiency is
making updates even slower by forcing Lucene to flush its RAM buffer much more
frequently than it has to when purely appending documents.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]