[
https://issues.apache.org/jira/browse/FLINK-10333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16701495#comment-16701495
]
Xiaogang Shi commented on FLINK-10333:
--------------------------------------
Hi [~till.rohrmann] Our cluster is suffering from unstable Zookeeper
connections and I think this effort will help deal with some problems.
But we are still suffering from some problems in leader elections. The main
cause is due the lack of atomicity. For example, JobMaster will write its
address on another znode when it becomes the leader. But its leadership may
already be lost when it is going to write its address (e.g., due to a long-time
full GC). To alleviate the problem, many double checks are used in the code.
Similar problems are also observed in the access to checkpoints. When an old
job master loses its leadership, it may still have access to the checkpoints in
Zookeeper and may modify them. Various methods (including locks to disallow
deletion and rescanning zookeeper on restoring) are deployed to deal with these
exceptions, but it does not seem to be a perfect solution.
After diving deep into the implementation of leader election in Zookeeper
Recipes, i have some ideas to improve our implementation. The basic idea is
that we should guarantee that only the elected leader has the access to
Zookeeper. In Zookeeper Recipes, each leader contender will create an election
znode which is SEQUENTIAL and EPHERMAL under a certain path. The contender with
the smallest sequence number will be elected as the leader. When the elected
leader fails, its election znode will disappear and the contender whose session
number is smallest among the remaining contenders will be elected as the new
leader. So once a contender grants the leadership, its election znode must
exist in Zookeeper. Hence, we can record the election node of each contender.
Once each contender wants to modify something in Zookeeper, it must put these
modification, together with the check on the existence of its election node, in
a Zookeeper transaction. If the contender has already lost its leadership, the
transaction will fail due to the unsatisfied check. That way, we can ensure
only the elected leader has access to the states in Zookeeper.
Currently, Zookeeper Recipes does not expose any interface to access the path
of election nodes. Maybe we need to reimplement the leader election with native
Zookeeper interfaces in Flink without the usage of Zookeeper Recipes.
What do you think of the idea?
> Rethink ZooKeeper based stores (SubmittedJobGraph, MesosWorker,
> CompletedCheckpoints)
> -------------------------------------------------------------------------------------
>
> Key: FLINK-10333
> URL: https://issues.apache.org/jira/browse/FLINK-10333
> Project: Flink
> Issue Type: Bug
> Components: Distributed Coordination
> Affects Versions: 1.5.3, 1.6.0, 1.7.0
> Reporter: Till Rohrmann
> Priority: Major
> Fix For: 1.8.0
>
>
> While going over the ZooKeeper based stores
> ({{ZooKeeperSubmittedJobGraphStore}}, {{ZooKeeperMesosWorkerStore}},
> {{ZooKeeperCompletedCheckpointStore}}) and the underlying
> {{ZooKeeperStateHandleStore}} I noticed several inconsistencies which were
> introduced with past incremental changes.
> * Depending whether {{ZooKeeperStateHandleStore#getAllSortedByNameAndLock}}
> or {{ZooKeeperStateHandleStore#getAllAndLock}} is called, deserialization
> problems will either lead to removing the Znode or not
> * {{ZooKeeperStateHandleStore}} leaves inconsistent state in case of
> exceptions (e.g. {{#getAllAndLock}} won't release the acquired locks in case
> of a failure)
> * {{ZooKeeperStateHandleStore}} has too many responsibilities. It would be
> better to move {{RetrievableStateStorageHelper}} out of it for a better
> separation of concerns
> * {{ZooKeeperSubmittedJobGraphStore}} overwrites a stored {{JobGraph}} even
> if it is locked. This should not happen since it could leave another system
> in an inconsistent state (imagine a changed {{JobGraph}} which restores from
> an old checkpoint)
> * Redundant but also somewhat inconsistent put logic in the different stores
> * Shadowing of ZooKeeper specific exceptions in {{ZooKeeperStateHandleStore}}
> which were expected to be caught in {{ZooKeeperSubmittedJobGraphStore}}
> * Getting rid of the {{SubmittedJobGraphListener}} would be helpful
> These problems made me think how reliable these components actually work.
> Since these components are very important, I propose to refactor them.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)