[ https://issues.apache.org/jira/browse/FLINK-10333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710963#comment-16710963 ]
TisonKun commented on FLINK-10333: ---------------------------------- Hi [~StephanEwen] and [~till.rohrmann] With an offline discuss with [~xiaogang.shi] we see ZK has a transactional mechanism so that we can ensure only the leader writes ZK. Given this knowledge and the inconsistencies [~till.rohrmann] noticed, before go into reimplementation, I did a survey of the usage of ZK based stores in flink. Ideally there is exact one role who writes a specific znode. There are four types of znodes that flink writes. Besides {{SubmittedJobGraphStore}} written by {{Dispatcher}}, {{CompletedCheckpointStore}} written by {{JobMaster}} and {{MesosWorkerStore}} written by {{MesosResourceManager}}, there is the {{RunningJobsRegistry}} that also has a ZK based implementation. All of the first three write ZK with a heavy “store lock”, but as [~xiaogang.shi] pointing out, it still lacks of atomicity. And with the solution based on ZK transaction — for example, a current {{Dispatcher}} leader {{setData}} with {{check}} for {{election-node-path}} — we can ensure the atomicity while getting rid of the lock. For the last one, {{RunningJobsRegistry}}, situation becomes a bit more complex. {{JobManagerRunner}} is responsible for {{setJobRunning}} and {{setJobFinished}} and {{Dispatcher}} is responsible for {{clearJob}}. This is against the ideal that one role for one znode. Also I notice that the gap between the semantic of {{DONE}} and that of clear is ambiguous. {{JobSchedulingStatus}} becomes {{DONE}} only if an {{ArchivedExecutionGraph}} generated so that we can prevent the same job be processed by an approach other than check an ephemeral {{DONE}} status. What if we replace {{setJobFinished}} with clearing {{RunningJobsRegistry}}? > Rethink ZooKeeper based stores (SubmittedJobGraph, MesosWorker, > CompletedCheckpoints) > ------------------------------------------------------------------------------------- > > Key: FLINK-10333 > URL: https://issues.apache.org/jira/browse/FLINK-10333 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination > Affects Versions: 1.5.3, 1.6.0, 1.7.0 > Reporter: Till Rohrmann > Priority: Major > Fix For: 1.8.0 > > > While going over the ZooKeeper based stores > ({{ZooKeeperSubmittedJobGraphStore}}, {{ZooKeeperMesosWorkerStore}}, > {{ZooKeeperCompletedCheckpointStore}}) and the underlying > {{ZooKeeperStateHandleStore}} I noticed several inconsistencies which were > introduced with past incremental changes. > * Depending whether {{ZooKeeperStateHandleStore#getAllSortedByNameAndLock}} > or {{ZooKeeperStateHandleStore#getAllAndLock}} is called, deserialization > problems will either lead to removing the Znode or not > * {{ZooKeeperStateHandleStore}} leaves inconsistent state in case of > exceptions (e.g. {{#getAllAndLock}} won't release the acquired locks in case > of a failure) > * {{ZooKeeperStateHandleStore}} has too many responsibilities. It would be > better to move {{RetrievableStateStorageHelper}} out of it for a better > separation of concerns > * {{ZooKeeperSubmittedJobGraphStore}} overwrites a stored {{JobGraph}} even > if it is locked. This should not happen since it could leave another system > in an inconsistent state (imagine a changed {{JobGraph}} which restores from > an old checkpoint) > * Redundant but also somewhat inconsistent put logic in the different stores > * Shadowing of ZooKeeper specific exceptions in {{ZooKeeperStateHandleStore}} > which were expected to be caught in {{ZooKeeperSubmittedJobGraphStore}} > * Getting rid of the {{SubmittedJobGraphListener}} would be helpful > These problems made me think how reliable these components actually work. > Since these components are very important, I propose to refactor them. -- This message was sent by Atlassian JIRA (v7.6.3#76005)