[
https://issues.apache.org/jira/browse/OAK-10903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nuno Santos resolved OAK-10903.
-------------------------------
Fix Version/s: 1.66.0
Resolution: Done
> Indexing job: when parallel download is enabled, the dump may miss documents
> updated during the download
> --------------------------------------------------------------------------------------------------------
>
> Key: OAK-10903
> URL: https://issues.apache.org/jira/browse/OAK-10903
> Project: Jackrabbit Oak
> Issue Type: Bug
> Components: indexing
> Reporter: Nuno Santos
> Priority: Major
> Fix For: 1.66.0
>
>
> There is a scenario that may result in the dump phase missing some documents,
> meaning that the index would not include some of the documents. The scenario
> is as follows:
> - The download process creates a checkpoint at time t0. This means that we
> need to download all documents that existed at time t0. Note that it's ok to
> download documents that were created or modified after t0, as the
> {{DocumentNodeStore.getNode(path, revision)}} will only consider the state as
> it was at the moment of the checkpoint, so at t0.
> - At the start of the download, the range of modified values is between 10
> and 90.
> - The query downloading in descending order downloads the documents from 90
> down-to 80.
> - At this moment, the connection to mongo is lost and the descending
> download thread tries to reconnect.
> - During this period, a document D with _modified=70 is updated. The update
> changes the _modified value to 93. Note that this document had not yet been
> downloaded by the descending download thread.
> - The descending download thread resumes downloading from 80 down. But this
> creates a new cursor in Mongo after D was updated to have _modified=93.
> Therefore, D will not be downloaded, even though it existed at time T0 and
> should be included.
> If the connection does not fail, Mongo uses a single cursor to traverse the
> repository so it will return a consistent view of the repository as it
> existed at the start of the download. But if the connection fails, the new
> query creates a new cursor, which sees the state as it is when the second
> cursor is created, which is no longer consistent with what was downloaded by
> the first query.
> Therefore, the following conditions are required for documents to be missed:
> - parallel download is enabled. If downloading only in ascending order, this
> problem will not occur because the reconnections will download the documents
> with the highest _modified values, so all documents that were modified after
> the start of the download will be downloaded. This will likely result in
> duplicates, but the merge sort phase of the indexing job eliminates
> duplicates so it is not a problem.
> - The descending download thread experiences connection failures.
> - The instance is being actively updated during the download.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)