[
https://issues.apache.org/jira/browse/FLINK-31509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17702823#comment-17702823
]
Emmanuel Leroy commented on FLINK-31509:
----------------------------------------
[~bgeng777]
I agree ideally the Service should only targets the master, however that is not
trivial in kubernetes. The operator would need to scan for the master
periodically, and update a label on the JM pods for the Service to target only
the one labeled master.
The sessionAffinity solution is simple to implement (literally, it's a 1-liner
in the Service definition) and works fine for the purpose.
> REST Service missing sessionAffinity causes job run failure with HA cluster
> ---------------------------------------------------------------------------
>
> Key: FLINK-31509
> URL: https://issues.apache.org/jira/browse/FLINK-31509
> Project: Flink
> Issue Type: Bug
> Components: Kubernetes Operator
> Environment: Flink 1.15 on Flink Operator 1.4.0 on Kubernetes 1.25.4,
> (optionally with Beam 2.46.0)
> but the issue was observed on Flink 1.14, 1.15 and 1.16 and on Flink Operator
> 1.2, 1.3, 1.3.1, 1.4.0
>
> Reporter: Emmanuel Leroy
> Priority: Major
>
> When using a Session Cluster with multiple Job Managers, the -rest service
> load balances the API requests to all job managers, not just the master.
> When submitting a FlinkSessionJob, I often see errors like: `jar <jar_id>.jar
> was not found`, because the submission is done in 2 steps:
> * upload the jar with `v1/jars/upload` which returns the `jar_id`
> * run the job with `v1/jars/<jar_id>/run`
> Unfortunately, with the Service load balacing between nodes, it is often the
> case that the jar is uploaded on a JM, and the run request happens on
> another, where the jar doesn't exist.
> A simple fix is to append the `sessionAffinity: ClientIP` on the -rest
> service, where the API calls from a given originating IP will always be
> routed to the same node.
> This issue is especially problematic with Beam, where the Beam job submission
> does not retry to run the job with the jar_id, and will fail, causing it to
> re-upload a new jar and retrying, until it is lucky enough to get the 2 calls
> in a row routed to the same node.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)