[
https://issues.apache.org/jira/browse/FLINK-5893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
zhijiang updated FLINK-5893:
----------------------------
Description:
The map of {{JobManagerRegistration}} in ResourceManager is not thread-safe,
and currently there may be two threads to operate the map concurrently to bring
unexpected results.
The scenario is like this :
- {{registerJobManager}}: When the job leader changes and the new JobManager
leader registers to ResourceManager, the new {{JobManagerRegistration}} will
replace the old one in the map with the same key {{JobID}}. This process is
triggered by rpc thread.
- Meanwhile, the {{JobLeaderIdService}} in ResourceManager could be aware of
job leader change and trigger the action {{jobLeaderLostLeadership}} in another
thread. In this action, it will remove the previous {{JobManagerRegistration}}
from the map by {{JobID}}, but the old {{JobManagerRegistration}} may be
already replaced by the new one from {{registerJobManager}}.
In summary, this race condition may cause the new {{JobManagerRegistration}}
removed from ResourceManager, resulting in exception when request slot from
ResourceManager.
Consider the solution of this issue, the {{jobLeaderLostLeadership}} can be
scheduled by {{runAsync}} in rpc thread and no need to bring extra lock for the
map.
was:
The map of {{JobManagerRegistration}} in {{ResourceManager}} is not
thread-safe, and currently there may be two threads to operate the map
concurrently to bring unexpected results.
The scenario is like this :
- {{registerJobManager}}: When the job leader changes and the new JobManager
leader registers to ResourceManager, the new {{JobManagerRegistration}} will
replace the old one in the map with the same key {{JobID}}. This process is
triggered by rpc thread.
- Meanwhile, the {{JobLeaderIdService}} in ResourceManager could be aware of
job leader change and trigger the action {{jobLeaderLostLeadership}} in another
thread. In this action, it will remove the previous {{JobManagerRegistration}}
from the map by {{JobID}}, but the old {{JobManagerRegistration}} may be
already replaced by the new one from {{registerJobManager}}.
In summary, this race condition may cause the new {{JobManagerRegistration}}
removed from ResourceManager, resulting in exception when request slot from
ResourceManager.
Consider the solution of this issue, the {{jobLeaderLostLeadership}} can be
scheduled by {{runAsync}} in rpc thread and no need to bring extra lock for the
map.
> Race condition in removing previous JobManagerRegistration in ResourceManager
> -----------------------------------------------------------------------------
>
> Key: FLINK-5893
> URL: https://issues.apache.org/jira/browse/FLINK-5893
> Project: Flink
> Issue Type: Bug
> Components: ResourceManager
> Reporter: zhijiang
>
> The map of {{JobManagerRegistration}} in ResourceManager is not thread-safe,
> and currently there may be two threads to operate the map concurrently to
> bring unexpected results.
> The scenario is like this :
> - {{registerJobManager}}: When the job leader changes and the new JobManager
> leader registers to ResourceManager, the new {{JobManagerRegistration}} will
> replace the old one in the map with the same key {{JobID}}. This process is
> triggered by rpc thread.
> - Meanwhile, the {{JobLeaderIdService}} in ResourceManager could be aware of
> job leader change and trigger the action {{jobLeaderLostLeadership}} in
> another thread. In this action, it will remove the previous
> {{JobManagerRegistration}} from the map by {{JobID}}, but the old
> {{JobManagerRegistration}} may be already replaced by the new one from
> {{registerJobManager}}.
> In summary, this race condition may cause the new {{JobManagerRegistration}}
> removed from ResourceManager, resulting in exception when request slot from
> ResourceManager.
> Consider the solution of this issue, the {{jobLeaderLostLeadership}} can be
> scheduled by {{runAsync}} in rpc thread and no need to bring extra lock for
> the map.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)