[ 
https://issues.apache.org/jira/browse/OAK-10903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nuno Santos resolved OAK-10903.
-------------------------------
    Fix Version/s: 1.66.0
       Resolution: Done

> Indexing job: when parallel download is enabled, the dump may miss documents 
> updated during the download
> --------------------------------------------------------------------------------------------------------
>
>                 Key: OAK-10903
>                 URL: https://issues.apache.org/jira/browse/OAK-10903
>             Project: Jackrabbit Oak
>          Issue Type: Bug
>          Components: indexing
>            Reporter: Nuno Santos
>            Priority: Major
>             Fix For: 1.66.0
>
>
> There is a scenario that may result in the dump phase missing some documents, 
> meaning that the index would not include some of the documents. The scenario 
> is as follows:
>  - The download process creates a checkpoint at time t0. This means that we 
> need to download all documents that existed at time t0. Note that it's ok to 
> download documents that were created or modified after t0, as the 
> {{DocumentNodeStore.getNode(path, revision)}} will only consider the state as 
> it was at the moment of the checkpoint, so at t0.
>  - At the start of the download, the range of modified values is between 10 
> and 90.
>  - The query downloading in descending order downloads the documents from 90 
> down-to 80.
>  - At this moment, the connection to mongo is lost and the descending 
> download thread tries to reconnect.
>  - During this period, a document D with _modified=70 is updated. The update 
> changes the _modified value to 93. Note that this document had not yet been 
> downloaded by the descending download thread.
>  - The descending download thread resumes downloading from 80 down. But this 
> creates a new cursor in Mongo after D was updated to have _modified=93. 
> Therefore, D will not be downloaded, even though it existed at time T0 and 
> should be included.
> If the connection does not fail, Mongo uses a single cursor to traverse the 
> repository so it will return a consistent view of the repository as it 
> existed at the start of the download. But if the connection fails, the new 
> query creates a new cursor, which sees the state as it is when the second 
> cursor is created, which is no longer consistent with what was downloaded by 
> the first query.
> Therefore, the following conditions are required for documents to be missed:
>  - parallel download is enabled. If downloading only in ascending order, this 
> problem will not occur because the reconnections will download the documents 
> with the highest _modified values, so all documents that were modified after 
> the start of the download will be downloaded. This will likely result in 
> duplicates, but the merge sort phase of the indexing job eliminates 
> duplicates so it is not a problem.
>  - The descending download thread experiences connection failures.
>  - The instance is being actively updated during the download.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to