[jira] [Comment Edited] (FLINK-10333) Rethink ZooKeeper based stores (SubmittedJobGraph, MesosWorker, CompletedCheckpoints)

TisonKun (JIRA) Sun, 09 Dec 2018 00:48:07 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-10333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16713900#comment-16713900
 ]


TisonKun edited comment on FLINK-10333 at 12/9/18 8:47 AM:
-----------------------------------------------------------

Diving into this issue, it is quite tricky that we'd better list what problem 
we want to resolve and what questions we need to answer.
h3. *Problems on current ZooKeeper based store*

1. We don't have a good znode layout. What is suggested is documented at 
{{ZooKeeperHaServices}} but {{ZooKeeperCompletedCheckpointStore}} and others 
break it.
 2. We use Curator API to deal with writing ZooKeeper based store, it is lack 
of atomicity. Specifically, we check the leadership and write the znode 
non-atomically.
 3. We use {{RunningJobsRegistry}} to track if a job running by a jm. Ideally 
the termination of a job at jm, at dispatcher and the removal of the submitted 
jobgraph, these three action should happen atomically, but it isn't.
h3. *Questions about resolving the problems*

*1. How to layout znode?*
 Actually we could follow the document at {{ZooKeeperHaServices}}. That is, all 
job related nodes should be children of the job node, and so as cluster. For 
example, znode like "checkpoint/job_id/1 persistent" is better to change to 
"job_id/checkpoint/1 persistent".

*2. What actions we want to be atomic?*
 At least two. One is we check the leader ship and write a znode, for example, 
commit checkpoint only if JobManager is leader. The other is we terminate a job 
atomically at jm and dispatcher, but since jm and dispatcher are different 
components, it's important to deal the gap between jm finished the job and 
dispatcher commit it(specifically, remove submitted jobgraph). Also it is worth 
to consider how users see the status of a job on WebUI, depending on jm or 
dispatcher.

*3. How to achieve atomicity?*
 As discussion above, we can take use of ZooKeeper transaction. An alternative 
might be a distributed lock. This scope is mainly about the implementation.


was (Author: tison):
Diving into this issue, it is quite tricky that we'd better list what Problems 
we need to answer.
h3. *Problems on current ZooKeeper based store*

1. We don't have a good znode layout. What is suggested is documented at 
{{ZooKeeperHaServices}} but {{ZooKeeperCompletedCheckpointStore}} and others 
break it.
 2. We use Curator API to deal with writing ZooKeeper based store, it is lack 
of atomicity. Specifically, we check the leadership and write the znode 
non-atomically.
 3. We use {{RunningJobsRegistry}} to track if a job running by a jm. Ideally 
the termination of a job at jm, at dispatcher and the removal of the submitted 
jobgraph, these three action should happen atomically, but it isn't.
h3. *Questions about resolving the problems*

*1. How to layout znode?*
 Actually we could follow the document at {{ZooKeeperHaServices}}. That is, all 
job related nodes should be children of the job node, and so as cluster. For 
example, znode like "checkpoint/job_id/1 persistent" is better to change to 
"job_id/checkpoint/1 persistent".

*2. What actions we want to be atomic?*
 At least two. One is we check the leader ship and write a znode, for example, 
commit checkpoint only if JobManager is leader. The other is we terminate a job 
atomically at jm and dispatcher, but since jm and dispatcher are different 
components, it's important to deal the gap between jm finished the job and 
dispatcher commit it(specifically, remove submitted jobgraph). Also it is worth 
to consider how users see the status of a job on WebUI, depending on jm or 
dispatcher.

*3. How to achieve atomicity?*
 As discussion above, we can take use of ZooKeeper transaction. An alternative 
might be a distributed lock. This scope is mainly about the implementation.

> Rethink ZooKeeper based stores (SubmittedJobGraph, MesosWorker, 
> CompletedCheckpoints)
> -------------------------------------------------------------------------------------
>
>                 Key: FLINK-10333
>                 URL: https://issues.apache.org/jira/browse/FLINK-10333
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination
>    Affects Versions: 1.5.3, 1.6.0, 1.7.0
>            Reporter: Till Rohrmann
>            Priority: Major
>             Fix For: 1.8.0
>
>
> While going over the ZooKeeper based stores 
> ({{ZooKeeperSubmittedJobGraphStore}}, {{ZooKeeperMesosWorkerStore}}, 
> {{ZooKeeperCompletedCheckpointStore}}) and the underlying 
> {{ZooKeeperStateHandleStore}} I noticed several inconsistencies which were 
> introduced with past incremental changes.
> * Depending whether {{ZooKeeperStateHandleStore#getAllSortedByNameAndLock}} 
> or {{ZooKeeperStateHandleStore#getAllAndLock}} is called, deserialization 
> problems will either lead to removing the Znode or not
> * {{ZooKeeperStateHandleStore}} leaves inconsistent state in case of 
> exceptions (e.g. {{#getAllAndLock}} won't release the acquired locks in case 
> of a failure)
> * {{ZooKeeperStateHandleStore}} has too many responsibilities. It would be 
> better to move {{RetrievableStateStorageHelper}} out of it for a better 
> separation of concerns
> * {{ZooKeeperSubmittedJobGraphStore}} overwrites a stored {{JobGraph}} even 
> if it is locked. This should not happen since it could leave another system 
> in an inconsistent state (imagine a changed {{JobGraph}} which restores from 
> an old checkpoint)
> * Redundant but also somewhat inconsistent put logic in the different stores
> * Shadowing of ZooKeeper specific exceptions in {{ZooKeeperStateHandleStore}} 
> which were expected to be caught in {{ZooKeeperSubmittedJobGraphStore}}
> * Getting rid of the {{SubmittedJobGraphListener}} would be helpful
> These problems made me think how reliable these components actually work. 
> Since these components are very important, I propose to refactor them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (FLINK-10333) Rethink ZooKeeper based stores (SubmittedJobGraph, MesosWorker, CompletedCheckpoints)

Reply via email to