[
https://issues.apache.org/jira/browse/LUCENE-5248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13781924#comment-13781924
]
Shai Erera commented on LUCENE-5248:
------------------------------------
I discussed briefly the details of the first data structure with Adrien and
Rob, and here's a proposal:
* Conceptually hold an int[] and long[] arrays (for docs and values
respectively), per field.
* When an update is applied to a document, we write an entry in the arrays, not
bothering to update an existing array.
** E.g. if updates come to docs 1,2,1,2,3,1,2,3, then the arrays will hold:
*** docs: {{[1,2,1,2,3,1,2,3]}}
*** values: {{[5,4,1,3,5,6,2,9]}}
** So the result of the updates should be doc1=6, doc2=2 and doc3=9.
* In writeLiveDocs we stable-sort the two arrays and take the last value of a
document. The sort will yield:
** docs: {{[1,1,1,2,2,2,3,3]}}
** values: {{[5,1,6,4,3,2,5,9]}}
** The Iterator<Number> will take the last value of each document
* To manage the data structure:
** FieldUpdates which holds the ints/longs, sorts and provides an iterator-like
API, e.g. nextDoc()/nextValue() which takes the last value for each document.
** For docs, use PackedInts.getMutable (with
bitsPerValue=PackedInts.bitsRequired(maxDoc - 1))
** For value, use GrowableWriter
** In ReaderAndLiveDocs, hold a Map<String,FieldUpdates>: a per-field
FieldUpdates instance.
As for the second structure, it's unrelated to the first (i.e. they can be
improved separately), though still it suffers from same issues -- a single
update that comes in while the segment is merging can affect millions of
documents and therefore can be inefficient too. One way to solve it is to use
the same structure mentioned above and manage multiple iterators in
IW.commitMergedDeletes (what's a bit more hair to this already hairy code? ;)),
so that for every document we "handle", we also iterate in parallel on all the
fields.
I'll start with the first structure, and then if it works well, I'll try to
apply it to the second. If you have comments/suggestions on how else to save
the updates, feel free to propose.
> Improve the data structure used in ReaderAndLiveDocs to hold the updates
> ------------------------------------------------------------------------
>
> Key: LUCENE-5248
> URL: https://issues.apache.org/jira/browse/LUCENE-5248
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/index
> Reporter: Shai Erera
> Assignee: Shai Erera
>
> Currently ReaderAndLiveDocs holds the updates in two structures:
> +Map<String,Map<Integer,Long>>+
> Holds a mapping from each field, to all docs that were updated and their
> values. This structure is updated when applyDeletes is called, and needs to
> satisfy several requirements:
> # Un-ordered writes: if a field "f" is updated by two terms, termA and termB,
> in that order, and termA affects doc=100 and termB doc=2, then the updates
> are applied in that order, meaning we cannot rely on updates coming in order.
> # Same document may be updated multiple times, either by same term (e.g.
> several calls to IW.updateNDV) or by different terms. Last update wins.
> # Sequential read: when writing the updates to the Directory
> (fieldsConsumer), we iterate on the docs in-order and for each one check if
> it's updated and if not, pull its value from the current DV.
> # A single update may affect several million documents, therefore need to be
> efficient w.r.t. memory consumption.
> +Map<Integer,Map<String,Long>>+
> Holds a mapping from a document, to all the fields that it was updated in and
> the updated value for each field. This is used by IW.commitMergedDeletes to
> apply the updates that came in while the segment was merging. The
> requirements this structure needs to satisfy are:
> # Access in doc order: this is how commitMergedDeletes works.
> # One-pass: we visit a document once (currently) and so if we can, it's
> better if we know all the fields in which it was updated. The updates are
> applied to the merged ReaderAndLiveDocs (where they are stored in the first
> structure mentioned above).
> Comments with proposals will follow next.
--
This message was sent by Atlassian JIRA
(v6.1#6144)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]