[
https://issues.apache.org/jira/browse/ZOOKEEPER-1777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13784784#comment-13784784
]
Germán Blanco commented on ZOOKEEPER-1777:
------------------------------------------
I bet it is even more confusing for me :-). But I do have the logs of how I
reproduced the problem, so I will upload that. It was anyway easy to reproduce,
I just followed the steps above with an ensemble of three and whatever
transactions.
Forcing the snapshot in every synchronisation is not the only solution. It can
also be solved with a check to verify that the followers have the same history
as the leader. Since synchronisation is the only time in which a different
history could be joining in, a check of the last transaction should be enough.
The check could be done comparing the entire transaction information or with a
checksum. This information (transaction info or checksum) could be sent from
the follower to the leader anytime before the decision of whether to
synchronise using DIFF, TRUNC or SNAP, and the leader could then send an SNAP
if the checksum was wrong (and log a big WARN message).
This also covers the problem of an operator wrongly starting one of the members
of the ensemble with a data dir coming from another ensemble.
However, this does mean a small change in the protocol, which can be done
keeping backwards compatibility. The leader reports that it is able to
optionally receive this information, and the follower sends that information
only if the leader supports it.
> Missing ephemeral nodes in one of the members of the ensemble
> -------------------------------------------------------------
>
> Key: ZOOKEEPER-1777
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1777
> Project: ZooKeeper
> Issue Type: Bug
> Components: quorum
> Affects Versions: 3.4.5
> Environment: Linux, Java 1.7
> Reporter: Germán Blanco
> Assignee: Germán Blanco
> Priority: Blocker
> Fix For: 3.4.6, 3.5.0
>
> Attachments: snaps.tar
>
>
> In a 3-servers ensemble, one of the followers doesn't see part of the
> ephemeral nodes that are present in the leader and the other follower.
> The 8 missing nodes in "the follower that is not ok" were created in the end
> of epoch 1, the ensemble is running in epoch 2.
--
This message was sent by Atlassian JIRA
(v6.1#6144)