Hi,

Zookeeper 3.5 introduced snapshot files, which did not exist in 3.4. 3.5 won't 
start if data is present, and there is no snapshot file.

https://issues.apache.org/jira/browse/ZOOKEEPER-3056 added an option to disable 
this check, to enable migration from 3.4 clusters. The workaround before then 
was to add an empty snapshot file to the dataDir.

As far as I can tell, the intended method of upgrading from 3.4 is to add 
snapshot.trust.empty=true to the Zookeeper configuration, upgrade to 3.5.x, and 
remove the snapshot.trust.empty property once snapshots exist on all nodes.

Sadly this method turns out to be inconvenient, as upgraded nodes will not 
write snapshots immediately. See 
https://issues.apache.org/jira/browse/ZOOKEEPER-3781?focusedCommentId=17261317&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17261317.

The reason some nodes may not write snapshots seems to be that when a new 
leader is elected, it may opt to send DIFF to the followers if they are not too 
far behind. If a follower receives a DIFF, it will not write a snapshot once 
NEWLEADER is received.

Is this snapshot write skipped for efficiency reasons, or to maintain 
correctness? If it is skipped only for efficiency, I think the upgrade 
experience could be improved, by always writing a snapshot at 
https://github.com/apache/zookeeper/blob/eeb053767c9e931ae72a2d8c59c0940da3da9679/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Learner.java#L739-L741
 if snapshot.trust.empty=true.

This would allow people upgrading from 3.4.x to set snapshot.trust.empty=true, 
upgrade and boot the cluster, and remove the property again very shortly after 
the reboot. 

What do you think?

Reply via email to