I think this is not unintentional and simply a case we did not consider.

Please file a JIRA.

On 15/06/2020 19:01, Wong Lucent wrote:
Hi,


Recently, our team upgraged our filnk cluster from version 1.7 to 1.10. And we 
met some problem when calling the flink rest api.

1) We deploy our flink cluster in standlone mode on kubernetes and use two 
Jobmanagers for HA.

2) We deployed a kubernetes service for the two jobmanagers to provide a 
unified url.

3) We use restful api to operate the flink cluster.

Afther upgraded to 1.10,  we found there is some difference between 1.7 when 
processing the savepoint query request. For example, if we send a savepoint 
trigger request to the leader jobmanager, in 1.7 we can query the standby 
jobmanager to get the status of the checkpoint, while in 1.10 it will return a 
404 response.

In 1.7 all the requests to standby Jobmanager will be forward to the leader in "RedirectHandler", while in 
1.10 the requesets will be forward with RPC in "LeaderRetrievalHandler". But there seems a issue in 
"AbstractAsynchronousOperationHandlers", in this handler, there is a local memory cache 
"completedOperationCache" to store the pending savpoint opeartion before redirect the request to the leader 
jobmanager, which seems not synced between all the jobmanagers. This makes only the jobmanager which receive the 
savepoint trigger requset can lookup the status of the savpoint, while the others can only return 404.

As this breaks our design in operating the flink cluster with restful API, we 
cannot use kubernetes service to hide the standby jobmanager any more. We hope 
to know is this behavior by design or it's really a bug?


Thanks and Best Regards
Lucent Wong


Reply via email to