Sebastien Pereira created FLINK-39086:
-----------------------------------------
Summary: BlobServer returns unresolvable pod hostname in
Kubernetes HA mode, breaking job submission (Regression from 1.20)
Key: FLINK-39086
URL: https://issues.apache.org/jira/browse/FLINK-39086
Project: Flink
Issue Type: Bug
Components: Deployment / Kubernetes, Runtime / Coordination
Affects Versions: 2.2.0
Environment: * {*}Flink Version{*}: 2.2.0
* {*}Flink Operator Version{*}: 1.13.0
* {*}Deployment Mode{*}: Native Kubernetes with HA enabled
Reporter: Sebastien Pereira
Job submission via Flink CLI within fails in Kubernetes HA mode because
BlobServer returns short pod hostnames (e.g., {{{}flink-74b96b4f8-xc49r{}}})
instead of resolvable addresses.
* Job submission from inside the cluster fails with {{UnknownHostException}}
* Affects FlinkDeployment CRs with HA enabled
* Breaks CI/CD pipelines running in-cluster
h3. Suspected cause
FLINK-38109 [1] changed how BlobServer address resolution works in Flink 2.2.
The new {{getBlobServerAddress()}} method [2] returns the pod's bind address
(hostname), but in Kubernetes, short pod hostnames aren't DNS-resolvable
without a headless service [3]. Flink 1.20 returned IP addresses which were
directly resolvable.
The change [4] in {{JobSubmitHandler.java}} (lines 192-200) now uses
{{gateway.getBlobServerAddress()}} instead of {{{}gateway.getHostname(){}}}.
h3. Reproduction
1/ Deploy Flink 2.2.0 with HA in Kubernetes:
{code}
apiVersion: flink.apache.org/v1beta1
kind: FlinkDeployment
metadata:
name: flink-example
spec:
image: flink:2.2.0
flinkVersion: v2_2
flinkConfiguration:
high-availability.type: kubernetes
high-availability.storageDir: file:///opt/flink/ha
jobManager:
resource:
memory: "2048m"
taskManager:
resource:
memory: "2048m"{code}
2/ Submit job from inside JobManager pod:
{code:sh}
kubectl exec -n <namespace> <jobmanager-pod> -- \
flink run /opt/flink/examples/streaming/StateMachineExample.jar{code}
{*}Result{*}: Fails with UnknownHostException: flink-74b96b4f8-xc49r
{code:java}
Caused by: java.io.IOException: Could not connect to BlobServer at address
flink-74b96b4f8-xc49r/<unresolved>:6124 at
org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:103) at
org.apache.flink.runtime.rest.handler.job.JobSubmitHandler.lambda$uploadJobGraphFiles$4(JobSubmitHandler.java:199)
Caused by: java.net.UnknownHostException: flink-74b96b4f8-xc49r {code}
The address flink-74b96b4f8-xc49r is a short pod hostname, not resolvable
without FQDN suffix or additional Kubernetes configuration.
h3. Tested Workaround
Add a headless service. Example:
{code}
apiVersion: v1
kind: Service
metadata:
name: flink-headless
spec:
clusterIP: None
publishNotReadyAddresses: true
selector:
app: flink
component: jobmanager
ports:
- name: blob
port: 6124
- name: rpc
port: 6123 {code}
* resolves the DNS issue and allows job submission to work
* Does not negatively impact deployments without HA enabled
* Headless services are a standard Kubernetes pattern for enabling pod
hostname resolution [5][6]
h3. Follow-up / questions
# Is returning pod short hostnames an intended behavior?
# Are there side effects or edge cases with the headless service approach that
need consideration?
# Are there other solutions besides headless service?
If headless service is an acceptable solution:
* Should it be documented as a requirement for HA in Kubernetes (and where)?
* Should it be created automatically by the Flink Kubernetes Operator?
h2. References
[1] FLINK-38109 - https://issues.apache.org/jira/browse/FLINK-38109
[2] Commit introducing the change -
[https://github.com/apache/flink/commit/deee02b665c01d56d43da4df2eadfbbef333ec3c]
[3] Kubernetes DNS for Services and Pods - Pod hostname DNS resolution -
[https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-hostname-and-subdomain-field]
[4] Affected code -
[https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/job/JobSubmitHandler.java#L192-L200]
[5] Kubernetes Documentation - Headless Services -
[https://kubernetes.io/docs/concepts/services-networking/service/#headless-services]
[6] StatefulSets and Headless Services -
[https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#stable-network-id]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)