[
https://issues.apache.org/jira/browse/FLINK-10333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953613#comment-16953613
]
Zili Chen commented on FLINK-10333:
-----------------------------------
Hi Till,
{{getLeaderStore()}} belongs to {{LeaderElectionService}}, not to
{{HighAvailabilityServices}}
{{getJobGraphStore(LeaderStore leaderStore)}} belongs to
{{HighAvailabilityServices}} because we have to pass {{LeaderStore}} to
construct leader store based {{JobGraphStore}}. Such leader store belongs to
{{LeaderElectionService}} created by
{{HighAvailabilityServices#getXxxLeaderElectionServices}} and we are unable to
get a leader store based {{JobGraphStore}} with {{getJobGraphStore()}}.
Maybe a branch implementing {{ZooKeeperHighAvailabilityServicesNG}} helps on
understanding.
> why we need to introduce the LeaderStore for the existing implementations
Hmm...it helps on a unique {{DefaultJobGraphStore}} instead of
{{Standalone|ZKJobGraphStore}}. But well, if the community try to add
{{ZooKeeperHighAvailabilityServicesNG}} as a new switch of ha mode, and the new
implementation to be compatible with {{JobGraphStore#releaseJob}} and so on, I
will try to change our internal commits to a compatible way with only this new
mode. Still, we need to add
HighAvailabilityServices#default getJobGraphStore(LeaderStore leaderStore) {
return getJobGraphStore(); }
and all {{LeaderElectionServices}} with
LeaderElectionServices#default getLeaderStore() { return null; }
Again, it is because we have to pass {{LeaderStore}} to construct leader store
based {{JobGraphStore}}. Such leader store belongs to {{LeaderElectionService}}
created by {{HighAvailabilityServices#getXxxLeaderElectionServices}} and we are
unable to get a leader store based {{JobGraphStore}} with
{{getJobGraphStore()}}.
> Rethink ZooKeeper based stores (SubmittedJobGraph, MesosWorker,
> CompletedCheckpoints)
> -------------------------------------------------------------------------------------
>
> Key: FLINK-10333
> URL: https://issues.apache.org/jira/browse/FLINK-10333
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.5.3, 1.6.0, 1.7.0
> Reporter: Till Rohrmann
> Priority: Major
> Attachments: screenshot-1.png
>
>
> While going over the ZooKeeper based stores
> ({{ZooKeeperSubmittedJobGraphStore}}, {{ZooKeeperMesosWorkerStore}},
> {{ZooKeeperCompletedCheckpointStore}}) and the underlying
> {{ZooKeeperStateHandleStore}} I noticed several inconsistencies which were
> introduced with past incremental changes.
> * Depending whether {{ZooKeeperStateHandleStore#getAllSortedByNameAndLock}}
> or {{ZooKeeperStateHandleStore#getAllAndLock}} is called, deserialization
> problems will either lead to removing the Znode or not
> * {{ZooKeeperStateHandleStore}} leaves inconsistent state in case of
> exceptions (e.g. {{#getAllAndLock}} won't release the acquired locks in case
> of a failure)
> * {{ZooKeeperStateHandleStore}} has too many responsibilities. It would be
> better to move {{RetrievableStateStorageHelper}} out of it for a better
> separation of concerns
> * {{ZooKeeperSubmittedJobGraphStore}} overwrites a stored {{JobGraph}} even
> if it is locked. This should not happen since it could leave another system
> in an inconsistent state (imagine a changed {{JobGraph}} which restores from
> an old checkpoint)
> * Redundant but also somewhat inconsistent put logic in the different stores
> * Shadowing of ZooKeeper specific exceptions in {{ZooKeeperStateHandleStore}}
> which were expected to be caught in {{ZooKeeperSubmittedJobGraphStore}}
> * Getting rid of the {{SubmittedJobGraphListener}} would be helpful
> These problems made me think how reliable these components actually work.
> Since these components are very important, I propose to refactor them.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)