[
https://issues.apache.org/jira/browse/FLINK-10333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710963#comment-16710963
]
TisonKun edited comment on FLINK-10333 at 12/6/18 11:49 AM:
------------------------------------------------------------
Hi [~StephanEwen] and [~till.rohrmann]
With an offline discuss with [~xiaogang.shi] we see ZK has a transactional
mechanism so that we can ensure only the leader writes ZK.
Given this knowledge and the inconsistencies [~till.rohrmann] noticed, before
go into reimplementation, I did a survey of the usage of ZK based stores in
flink. Ideally there is exact one role who writes a specific znode.
There are four types of znodes that flink writes. Besides
{{SubmittedJobGraphStore}} written by {{Dispatcher}},
{{CompletedCheckpointStore}} written by {{JobMaster}} and {{MesosWorkerStore}}
written by {{MesosResourceManager}}, there is the {{RunningJobsRegistry}} that
also has a ZK based implementation.
All of the first three write ZK with a heavy “store lock”, but as
[~xiaogang.shi] pointing out, it still lacks of atomicity. And with the
solution based on ZK transaction — for example, a current {{Dispatcher}} leader
{{setData}} with {{check}} for {{election-node-path}} — we can ensure the
atomicity while getting rid of the lock.
For the last one, {{RunningJobsRegistry}}, situation becomes a bit more
complex. {{JobManagerRunner}} is responsible for {{setJobRunning}} and
{{setJobFinished}} and {{Dispatcher}} is responsible for {{clearJob}}. This is
against the ideal that one role for one znode. Also I notice that the gap
between the semantic of {{DONE}} and that of clear is ambiguous.
We might try to prevent the same job be processed by an approach other than
check an ephemeral {{DONE}} status. What if we replace {{setJobFinished}} with
clearing {{RunningJobsRegistry}}?
was (Author: tison):
Hi [~StephanEwen] and [~till.rohrmann]
With an offline discuss with [~xiaogang.shi] we see ZK has a transactional
mechanism so that we can ensure only the leader writes ZK.
Given this knowledge and the inconsistencies [~till.rohrmann] noticed, before
go into reimplementation, I did a survey of the usage of ZK based stores in
flink. Ideally there is exact one role who writes a specific znode.
There are four types of znodes that flink writes. Besides
{{SubmittedJobGraphStore}} written by {{Dispatcher}},
{{CompletedCheckpointStore}} written by {{JobMaster}} and {{MesosWorkerStore}}
written by {{MesosResourceManager}}, there is the {{RunningJobsRegistry}} that
also has a ZK based implementation.
All of the first three write ZK with a heavy “store lock”, but as
[~xiaogang.shi] pointing out, it still lacks of atomicity. And with the
solution based on ZK transaction — for example, a current {{Dispatcher}} leader
{{setData}} with {{check}} for {{election-node-path}} — we can ensure the
atomicity while getting rid of the lock.
For the last one, {{RunningJobsRegistry}}, situation becomes a bit more
complex. {{JobManagerRunner}} is responsible for {{setJobRunning}} and
{{setJobFinished}} and {{Dispatcher}} is responsible for {{clearJob}}. This is
against the ideal that one role for one znode. Also I notice that the gap
between the semantic of {{DONE}} and that of clear is ambiguous.
{{JobSchedulingStatus}} becomes {{DONE}} only if an {{ArchivedExecutionGraph}}
generated so that we can prevent the same job be processed by an approach other
than check an ephemeral {{DONE}} status. What if we replace {{setJobFinished}}
with clearing {{RunningJobsRegistry}}?
> Rethink ZooKeeper based stores (SubmittedJobGraph, MesosWorker,
> CompletedCheckpoints)
> -------------------------------------------------------------------------------------
>
> Key: FLINK-10333
> URL: https://issues.apache.org/jira/browse/FLINK-10333
> Project: Flink
> Issue Type: Bug
> Components: Distributed Coordination
> Affects Versions: 1.5.3, 1.6.0, 1.7.0
> Reporter: Till Rohrmann
> Priority: Major
> Fix For: 1.8.0
>
>
> While going over the ZooKeeper based stores
> ({{ZooKeeperSubmittedJobGraphStore}}, {{ZooKeeperMesosWorkerStore}},
> {{ZooKeeperCompletedCheckpointStore}}) and the underlying
> {{ZooKeeperStateHandleStore}} I noticed several inconsistencies which were
> introduced with past incremental changes.
> * Depending whether {{ZooKeeperStateHandleStore#getAllSortedByNameAndLock}}
> or {{ZooKeeperStateHandleStore#getAllAndLock}} is called, deserialization
> problems will either lead to removing the Znode or not
> * {{ZooKeeperStateHandleStore}} leaves inconsistent state in case of
> exceptions (e.g. {{#getAllAndLock}} won't release the acquired locks in case
> of a failure)
> * {{ZooKeeperStateHandleStore}} has too many responsibilities. It would be
> better to move {{RetrievableStateStorageHelper}} out of it for a better
> separation of concerns
> * {{ZooKeeperSubmittedJobGraphStore}} overwrites a stored {{JobGraph}} even
> if it is locked. This should not happen since it could leave another system
> in an inconsistent state (imagine a changed {{JobGraph}} which restores from
> an old checkpoint)
> * Redundant but also somewhat inconsistent put logic in the different stores
> * Shadowing of ZooKeeper specific exceptions in {{ZooKeeperStateHandleStore}}
> which were expected to be caught in {{ZooKeeperSubmittedJobGraphStore}}
> * Getting rid of the {{SubmittedJobGraphListener}} would be helpful
> These problems made me think how reliable these components actually work.
> Since these components are very important, I propose to refactor them.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)