[
https://issues.apache.org/jira/browse/FLINK-10333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16713900#comment-16713900
]
TisonKun edited comment on FLINK-10333 at 12/9/18 8:47 AM:
-----------------------------------------------------------
Diving into this issue, it is quite tricky that we'd better list what problem
we want to resolve and what questions we need to answer.
h3. *Problems on current ZooKeeper based store*
1. We don't have a good znode layout. What is suggested is documented at
{{ZooKeeperHaServices}} but {{ZooKeeperCompletedCheckpointStore}} and others
break it.
2. We use Curator API to deal with writing ZooKeeper based store, it is lack
of atomicity. Specifically, we check the leadership and write the znode
non-atomically.
3. We use {{RunningJobsRegistry}} to track if a job running by a jm. Ideally
the termination of a job at jm, at dispatcher and the removal of the submitted
jobgraph, these three action should happen atomically, but it isn't.
h3. *Questions about resolving the problems*
*1. How to layout znode?*
Actually we could follow the document at {{ZooKeeperHaServices}}. That is, all
job related nodes should be children of the job node, and so as cluster. For
example, znode like "checkpoint/job_id/1 persistent" is better to change to
"job_id/checkpoint/1 persistent".
*2. What actions we want to be atomic?*
At least two. One is we check the leader ship and write a znode, for example,
commit checkpoint only if JobManager is leader. The other is we terminate a job
atomically at jm and dispatcher, but since jm and dispatcher are different
components, it's important to deal the gap between jm finished the job and
dispatcher commit it(specifically, remove submitted jobgraph). Also it is worth
to consider how users see the status of a job on WebUI, depending on jm or
dispatcher.
*3. How to achieve atomicity?*
As discussion above, we can take use of ZooKeeper transaction. An alternative
might be a distributed lock. This scope is mainly about the implementation.
was (Author: tison):
Diving into this issue, it is quite tricky that we'd better list what Problems
we need to answer.
h3. *Problems on current ZooKeeper based store*
1. We don't have a good znode layout. What is suggested is documented at
{{ZooKeeperHaServices}} but {{ZooKeeperCompletedCheckpointStore}} and others
break it.
2. We use Curator API to deal with writing ZooKeeper based store, it is lack
of atomicity. Specifically, we check the leadership and write the znode
non-atomically.
3. We use {{RunningJobsRegistry}} to track if a job running by a jm. Ideally
the termination of a job at jm, at dispatcher and the removal of the submitted
jobgraph, these three action should happen atomically, but it isn't.
h3. *Questions about resolving the problems*
*1. How to layout znode?*
Actually we could follow the document at {{ZooKeeperHaServices}}. That is, all
job related nodes should be children of the job node, and so as cluster. For
example, znode like "checkpoint/job_id/1 persistent" is better to change to
"job_id/checkpoint/1 persistent".
*2. What actions we want to be atomic?*
At least two. One is we check the leader ship and write a znode, for example,
commit checkpoint only if JobManager is leader. The other is we terminate a job
atomically at jm and dispatcher, but since jm and dispatcher are different
components, it's important to deal the gap between jm finished the job and
dispatcher commit it(specifically, remove submitted jobgraph). Also it is worth
to consider how users see the status of a job on WebUI, depending on jm or
dispatcher.
*3. How to achieve atomicity?*
As discussion above, we can take use of ZooKeeper transaction. An alternative
might be a distributed lock. This scope is mainly about the implementation.
> Rethink ZooKeeper based stores (SubmittedJobGraph, MesosWorker,
> CompletedCheckpoints)
> -------------------------------------------------------------------------------------
>
> Key: FLINK-10333
> URL: https://issues.apache.org/jira/browse/FLINK-10333
> Project: Flink
> Issue Type: Bug
> Components: Distributed Coordination
> Affects Versions: 1.5.3, 1.6.0, 1.7.0
> Reporter: Till Rohrmann
> Priority: Major
> Fix For: 1.8.0
>
>
> While going over the ZooKeeper based stores
> ({{ZooKeeperSubmittedJobGraphStore}}, {{ZooKeeperMesosWorkerStore}},
> {{ZooKeeperCompletedCheckpointStore}}) and the underlying
> {{ZooKeeperStateHandleStore}} I noticed several inconsistencies which were
> introduced with past incremental changes.
> * Depending whether {{ZooKeeperStateHandleStore#getAllSortedByNameAndLock}}
> or {{ZooKeeperStateHandleStore#getAllAndLock}} is called, deserialization
> problems will either lead to removing the Znode or not
> * {{ZooKeeperStateHandleStore}} leaves inconsistent state in case of
> exceptions (e.g. {{#getAllAndLock}} won't release the acquired locks in case
> of a failure)
> * {{ZooKeeperStateHandleStore}} has too many responsibilities. It would be
> better to move {{RetrievableStateStorageHelper}} out of it for a better
> separation of concerns
> * {{ZooKeeperSubmittedJobGraphStore}} overwrites a stored {{JobGraph}} even
> if it is locked. This should not happen since it could leave another system
> in an inconsistent state (imagine a changed {{JobGraph}} which restores from
> an old checkpoint)
> * Redundant but also somewhat inconsistent put logic in the different stores
> * Shadowing of ZooKeeper specific exceptions in {{ZooKeeperStateHandleStore}}
> which were expected to be caught in {{ZooKeeperSubmittedJobGraphStore}}
> * Getting rid of the {{SubmittedJobGraphListener}} would be helpful
> These problems made me think how reliable these components actually work.
> Since these components are very important, I propose to refactor them.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)