Hi,
Recently, our team upgraged our filnk cluster from version 1.7 to 1.10. And we met some problem when calling the flink rest api. 1) We deploy our flink cluster in standlone mode on kubernetes and use two Jobmanagers for HA. 2) We deployed a kubernetes service for the two jobmanagers to provide a unified url. 3) We use restful api to operate the flink cluster. Afther upgraded to 1.10, we found there is some difference between 1.7 when processing the savepoint query request. For example, if we send a savepoint trigger request to the leader jobmanager, in 1.7 we can query the standby jobmanager to get the status of the checkpoint, while in 1.10 it will return a 404 response. In 1.7 all the requests to standby Jobmanager will be forward to the leader in "RedirectHandler", while in 1.10 the requesets will be forward with RPC in "LeaderRetrievalHandler". But there seems a issue in "AbstractAsynchronousOperationHandlers", in this handler, there is a local memory cache "completedOperationCache" to store the pending savpoint opeartion before redirect the request to the leader jobmanager, which seems not synced between all the jobmanagers. This makes only the jobmanager which receive the savepoint trigger requset can lookup the status of the savpoint, while the others can only return 404. As this breaks our design in operating the flink cluster with restful API, we cannot use kubernetes service to hide the standby jobmanager any more. We hope to know is this behavior by design or it's really a bug? Thanks and Best Regards Lucent Wong