[jira] [Commented] (LUCENE-5248) Improve the data structure used in ReaderAndLiveDocs to hold the updates

Shai Erera (JIRA) Wed, 09 Oct 2013 20:27:45 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13791146#comment-13791146
 ]

Shai Erera commented on LUCENE-5248:
------------------------------------

bq. should UpdatesIterator implement DISI? It seems like it might be a good fit.

I thought about it but decided not to since e.g. advance() is never going to be 
called. Because we need to pass a value for _all_ documents (to 
FieldsConsumer), we will always call nextDoc(). Do you have a usecase in mind?

bq. When we have multiple related structures like this, maybe we can add a 
comment as to what each is?

I will.

bq. is bitset(maxdoc) really needed since usually its sparse? why not an 
openbitset parallel with "docs"?

I like the idea. So instead of calling docsWithField.set(doc), I will call 
docsWithField.set(size), and then sort this one too. It will definitely save 
memory for small updates and waste some bits for large updates, but since the 
docs/values structures are bigger than it in these cases, I think that's fine.

bq. do these really need to be absolute-encoded?

The docs are only positive integers, but we're not guaranteed on the order they 
arrive. So if we'll encode the delta we may end up w/ negative numbers, and 
won't be able to set BPV to bitsRequired(maxDoc-1). After we sort the arrays 
(when getUpdates() is called), we can re-encode deltas only, but I don't know 
if it's worth it since the arrays will be GC'd if the segment is merging...

bq. The clear should be unnecessary!

You're right! It's a leftover from a debugging session that I forgot to remove.

bq. Is this really a limitation?

The arrays aren't limited, so not. But because we sort the arrays and the 
Sorter interfaces currently only take integer indexes, it is. Maybe we should 
one day change the Sorter interface to take long indexes?

BTW, if we do that and intend to support more than 2B entries, we should either 
use FixedBitSet for docsWithField, or move to another parallel PagedMutable so 
that it too can hold as many entries as the docs and values can.

bq. Can we just use Long.compare?

We don't have Long.compare in Java 1.6 as far as I checked. I will change it to 
a ternary <>check.

> Improve the data structure used in ReaderAndLiveDocs to hold the updates
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-5248
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5248
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>         Attachments: LUCENE-5248.patch, LUCENE-5248.patch, LUCENE-5248.patch, 
> LUCENE-5248.patch
>
>
> Currently ReaderAndLiveDocs holds the updates in two structures:
> +Map<String,Map<Integer,Long>>+
> Holds a mapping from each field, to all docs that were updated and their 
> values. This structure is updated when applyDeletes is called, and needs to 
> satisfy several requirements:
> # Un-ordered writes: if a field "f" is updated by two terms, termA and termB, 
> in that order, and termA affects doc=100 and termB doc=2, then the updates 
> are applied in that order, meaning we cannot rely on updates coming in order.
> # Same document may be updated multiple times, either by same term (e.g. 
> several calls to IW.updateNDV) or by different terms. Last update wins.
> # Sequential read: when writing the updates to the Directory 
> (fieldsConsumer), we iterate on the docs in-order and for each one check if 
> it's updated and if not, pull its value from the current DV.
> # A single update may affect several million documents, therefore need to be 
> efficient w.r.t. memory consumption.
> +Map<Integer,Map<String,Long>>+
> Holds a mapping from a document, to all the fields that it was updated in and 
> the updated value for each field. This is used by IW.commitMergedDeletes to 
> apply the updates that came in while the segment was merging. The 
> requirements this structure needs to satisfy are:
> # Access in doc order: this is how commitMergedDeletes works.
> # One-pass: we visit a document once (currently) and so if we can, it's 
> better if we know all the fields in which it was updated. The updates are 
> applied to the merged ReaderAndLiveDocs (where they are stored in the first 
> structure mentioned above).
> Comments with proposals will follow next.

--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5248) Improve the data structure used in ReaderAndLiveDocs to hold the updates

Reply via email to