Rest handler redirect problem

Wong Lucent Mon, 15 Jun 2020 10:02:26 -0700

Hi,


Recently, our team upgraged our filnk cluster from version 1.7 to 1.10. And we 
met some problem when calling the flink rest api.

1) We deploy our flink cluster in standlone mode on kubernetes and use two 
Jobmanagers for HA.

2) We deployed a kubernetes service for the two jobmanagers to provide a 
unified url.

3) We use restful api to operate the flink cluster.

Afther upgraded to 1.10,  we found there is some difference between 1.7 when 
processing the savepoint query request. For example, if we send a savepoint 
trigger request to the leader jobmanager, in 1.7 we can query the standby 
jobmanager to get the status of the checkpoint, while in 1.10 it will return a 
404 response.

In 1.7 all the requests to standby Jobmanager will be forward to the leader in 
"RedirectHandler", while in 1.10 the requesets will be forward with RPC in 
"LeaderRetrievalHandler". But there seems a issue in 
"AbstractAsynchronousOperationHandlers", in this handler, there is a local 
memory cache "completedOperationCache" to store the pending savpoint opeartion 
before redirect the request to the leader jobmanager, which seems not synced 
between all the jobmanagers. This makes only the jobmanager which receive the 
savepoint trigger requset can lookup the status of the savpoint, while the 
others can only return 404.

As this breaks our design in operating the flink cluster with restful API, we 
cannot use kubernetes service to hide the standby jobmanager any more. We hope 
to know is this behavior by design or it's really a bug?


Thanks and Best Regards
Lucent Wong

Rest handler redirect problem

Reply via email to