[
https://issues.apache.org/jira/browse/LUCENE-7791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15976777#comment-15976777
]
Jim Ferenczi commented on LUCENE-7791:
--------------------------------------
Great catch indeed !
{quote}
Also, check for range of values for given field is now happening based on
original ID (e.g. "upto < size"), so flushing can now lost some values, even
without hitting AIOOBE.
{quote}
I think it's ok since "size" corresponds to the max docID we have in the buffer
so we cannot lost values here unless I am missing something ?
So the only problem here is that we don't check if the remapped doc id is
greater than the capacity of the bitset.
{quote}
please check changes from LUCENE-7579 for confirmation of lack of additional
bugs in other flush-sorting writers.
{quote}
I did and the other doc value sorters do not use a bitset to handle missing
values. I think we are safe with this patch.
The patch looks good and the test too, this bug only appears in 6.x since the
code is slightly different in 7.
I'll merge shortly unless [~mikemccand] has something to add here ?
> AIOOBE on flush+sort
> --------------------
>
> Key: LUCENE-7791
> URL: https://issues.apache.org/jira/browse/LUCENE-7791
> Project: Lucene - Core
> Issue Type: Bug
> Components: core/index
> Affects Versions: 6.5
> Reporter: Przemysław Szeremiota
> Labels: patch
> Attachments: sortflush.patch, sortflush-test.patch
>
>
> On released 6.5.0 version, flushing operation on sorted index throws
> ArrayIndexOutOfBoudException in NumericDocValuesWriter, NormValuesWriter and
> BinaryDocValuesWriter.
> New SortedXXXIterators are looking up documents in FixedBitSets or
> PackedValues based on remapped (sorted) document ID, without checking
> BitSets/Values ranges, which are based on original document IDs. Meanwhile
> FixedBitSets can be sparse not only in between documents with fields, but
> also after last (originally) document with given field (because writer's
> addValue() is not called for last documents without values for fields). So
> remapped (sorted) values range can have different useful values range and
> bounds checking should be done for remapped and not original ID.
> We were hit by this bug because our indexes are built from independent
> sources by partial updating fragments of documents, so there is always some
> documents without values in some fields.
> As I understand this bug, it shows when:
> - maxDoc is greater than 64 (64 is pre-allocated size for writers
> FixedBitSets)
> - some number of last taken documents have empty fields (so FixedBitSet won't
> be reallocated to maxDoc)
> Also, check for range of values for given field is now happening based on
> original ID (e.g. "upto < size"), so flushing can now lost some values, even
> without hitting AIOOBE.
> I will attach patch resolving issues with some writers; for other writers
> from LUCENE-7579, I am not sure if there are similar bugs in them; patch
> resolved our indexing issues, please check changes from LUCENE-7579 for
> confirmation of lack of additional bugs in other flush-sorting writers.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]