[
https://issues.apache.org/jira/browse/ZOOKEEPER-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated ZOOKEEPER-3249:
--------------------------------------
Labels: pull-request-available (was: )
> Avoid reverting the cversion and pzxid during replaying txns with fuzzy
> snapshot
> --------------------------------------------------------------------------------
>
> Key: ZOOKEEPER-3249
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3249
> Project: ZooKeeper
> Issue Type: Improvement
> Components: server
> Affects Versions: 3.6.0
> Reporter: Fangmin Lv
> Assignee: Fangmin Lv
> Priority: Major
> Labels: pull-request-available
> Fix For: 3.6.0
>
>
> The only case we need to have [the tricky hack
> code|https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/server/DataTree.java#L1036-L1065]
> , is because of the scenario below:
> If the child is deleted due to session close and re-created in a different
> global session after that the parent is serialized, then when replay the txn
> because the node is belonging to a different session, replay the closeSession
> txn won't delete it anymore, and we'll get NODEEXISTS error when replay the
> createNode txn. In this case, we need to update the cversion and pzxid to the
> new value with this tricky code here.
> This could be solved in ZOOKEEPER-3145 with explicit CloseSessionTxn. In
> theory, with that code, we don't need this kind of hack code anymore, but
> there is another case, which could cause the cversion and pzxid being
> reverted, and we still need to patch it, here is the scenario:
> 1. Start to take snapshot at T0
> 2. Txn T1 create /P/N1, set P's cversion and pzxid to (1, 1)
> 3. Txn T2 create /P/N2, set P's cversion and pzxid to (2, 2)
> 4. Txn T3 delete /P/N1, set P's pzxid to 3, which is (2, 3)
> Those state are in the fuzzy snapshot.
> When loading the snapshot and txns during start up based on the current code:
> 1. replay T1, since /P/N1 is not exist, we'll overwrite P's cversion and
> pzxid to (1, 1)
> 2. replay T2, node already exist, so go through the hack code to patch
> cversion and pzxid, and it became (2, 2)
> 3. replay T3, set P's pzxid to 3, which is now (2, 3)
> The state is consistent with the tricky patch code, but it's error-prone and
> hacky, we should remove that. To be able to remove that, in this patch, we're
> going to check the cversion first and avoid reverting the cversion and pzxid
> when replaying txns.
> We've also added metrics to verify that logic is not active on prod anymore,
> after that I'll open another Jira to remove it to make the logic cleaner.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)