[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15483048#comment-15483048
 ] 

Rakesh R commented on ZOOKEEPER-2574:
-------------------------------------

Good catch, [~abhishekrai]. Thanks for the proposed patch with good unit 
testing.

I agree the following scenario will occur in learner side, where snapshotting 
has happened multiple times without accompanying log rollover, [refer source 
code|https://github.com/apache/zookeeper/blob/branch-3.4/src/java/main/org/apache/zookeeper/server/quorum/Learner.java#L439].
 I think there is mismatch in the zookeeper documentation, it says "This 
snapshot supercedes all previous logs", [please refer zk 
doc|https://zookeeper.apache.org/doc/r3.4.9/zookeeperAdmin.html#Ongoing+Data+Directory+Cleanup].
 But in this particular case the latest transactions after snapshotting is gong 
to the existing transaction log file, which is behind if we look at the 
naming(zxid part). I'm not sure whether this is intentionally implemented?. 
Again, iiuc, recovery will use a snapshot file + delta taken from the 
transaction log file name after snapshotted zxid, but in this case admins need 
to consider the log file just before, the snapshot file. Should we need to make 
the snapshot file and log file consistent?

[~phunt], any thoughts?

{code}
2. Following files exist:
log.100 spans transactions from zxid=100 till zxid=140 (inclusive)
snapshot.110 - snapshot as of zxid=110
snapshot.120 - snapshot as of zxid=120
snapshot.130 - snapshot as of zxid=130
{code}



> PurgeTxnLog can inadvertently delete required txn log files
> -----------------------------------------------------------
>
>                 Key: ZOOKEEPER-2574
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2574
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.4.7, 3.4.8, 3.5.0, 3.5.1, 3.5.2
>         Environment: Zookeeper 3.4.8, standalone, and 3-server quorum
>            Reporter: Abhishek Rai
>            Assignee: Abhishek Rai
>            Priority: Blocker
>             Fix For: 3.4.10, 3.5.3
>
>         Attachments: ZOOKEEPER-2574.2.patch, ZOOKEEPER-2574.3.patch, 
> ZOOKEEPER-2574.patch
>
>
> As part of the fix for ZOOKEEPER-1797, the call to 
> FileTxnSnapLog.getSnapshotLogs() was removed from PurgeTxnLog.java.  As a 
> result, some old-looking but required txn log files can be deleted, resulting 
> in data corruption or loss.
> For example, consider the following:
> 1. Configuration:
> autopurge.snapRetainCount=3
> 2. Following files exist:
> log.100 spans transactions from zxid=100 till zxid=140 (inclusive)
> snapshot.110 - snapshot as of zxid=110
> snapshot.120 - snapshot as of zxid=120
> snapshot.130 - snapshot as of zxid=130
> Above scenario is possible when snapshotting has happened multiple times but 
> without accompanying log rollover, which is possible if the server was 
> running as a learner.
> 3. PurgeTxnLog retains all snapshots but deletes log.100 because its zxid is 
> older than the zxid of the oldest snapshot (110).  This results in loss of 
> transactions in the range 131-140.
> Before the fix for ZOOKEEPER-1797, this was avoided by the call to 
> FileTxnSnapLog.getSnapshotLogs() which finds and retains the newest txn log 
> file with starting zxid < oldest retained snapshot's highest zxid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to