[
https://issues.apache.org/jira/browse/LUCENE-5248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13790834#comment-13790834
]
Robert Muir commented on LUCENE-5248:
-------------------------------------
Hi Shai:
should UpdatesIterator implement DISI? It seems like it might be a good fit.
{code}
+ private final FixedBitSet docsWithField;
+ private PagedMutable docs;
+ private PagedGrowableWriter values;
{code}
When we have multiple related structures like this, maybe we can add a comment
as to what each is?
Something like:
{code}
// bit per docid: set if the value is "real"
// TODO: is bitset(maxdoc) really needed since usually its sparse? why not an
openbitset parallel with "docs"?
private final FixedBitSet docsWithField;
// holds a list of documents.
// TODO: do these really need to be absolute-encoded?
private PagedMutable docs;
// holds a list of values, parallel with docs
private PagedGrowableWriter values;
{code}
{code}
+ docsWithField = new FixedBitSet(maxDoc);
+ docsWithField.clear(0, maxDoc)
{code}
The clear should be unnecessary!
{code}
+ public void add(int doc, Long value) {
+ assert value != null;
+ if (size == Integer.MAX_VALUE) {
+ throw new IllegalStateException("cannot support more than
Integer.MAX_VALUE doc/value entries");
+ }
{code}
Is this really a limitation?
{code}
+ @Override
+ protected int compare(int i, int j) {
+ return (int) (docs.get(i) - docs.get(j));
+ }
{code}
Can we just use Long.compare? this subtraction may be safe... but it would
smell better.
> Improve the data structure used in ReaderAndLiveDocs to hold the updates
> ------------------------------------------------------------------------
>
> Key: LUCENE-5248
> URL: https://issues.apache.org/jira/browse/LUCENE-5248
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/index
> Reporter: Shai Erera
> Assignee: Shai Erera
> Attachments: LUCENE-5248.patch, LUCENE-5248.patch, LUCENE-5248.patch,
> LUCENE-5248.patch
>
>
> Currently ReaderAndLiveDocs holds the updates in two structures:
> +Map<String,Map<Integer,Long>>+
> Holds a mapping from each field, to all docs that were updated and their
> values. This structure is updated when applyDeletes is called, and needs to
> satisfy several requirements:
> # Un-ordered writes: if a field "f" is updated by two terms, termA and termB,
> in that order, and termA affects doc=100 and termB doc=2, then the updates
> are applied in that order, meaning we cannot rely on updates coming in order.
> # Same document may be updated multiple times, either by same term (e.g.
> several calls to IW.updateNDV) or by different terms. Last update wins.
> # Sequential read: when writing the updates to the Directory
> (fieldsConsumer), we iterate on the docs in-order and for each one check if
> it's updated and if not, pull its value from the current DV.
> # A single update may affect several million documents, therefore need to be
> efficient w.r.t. memory consumption.
> +Map<Integer,Map<String,Long>>+
> Holds a mapping from a document, to all the fields that it was updated in and
> the updated value for each field. This is used by IW.commitMergedDeletes to
> apply the updates that came in while the segment was merging. The
> requirements this structure needs to satisfy are:
> # Access in doc order: this is how commitMergedDeletes works.
> # One-pass: we visit a document once (currently) and so if we can, it's
> better if we know all the fields in which it was updated. The updates are
> applied to the merged ReaderAndLiveDocs (where they are stored in the first
> structure mentioned above).
> Comments with proposals will follow next.
--
This message was sent by Atlassian JIRA
(v6.1#6144)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]