Sebastien Pereira created FLINK-39086:
-----------------------------------------

             Summary: BlobServer returns unresolvable pod hostname in 
Kubernetes HA mode, breaking job submission (Regression from 1.20)
                 Key: FLINK-39086
                 URL: https://issues.apache.org/jira/browse/FLINK-39086
             Project: Flink
          Issue Type: Bug
          Components: Deployment / Kubernetes, Runtime / Coordination
    Affects Versions: 2.2.0
         Environment: * {*}Flink Version{*}: 2.2.0
 * {*}Flink Operator Version{*}: 1.13.0
 * {*}Deployment Mode{*}: Native Kubernetes with HA enabled
            Reporter: Sebastien Pereira


Job submission via Flink CLI within fails in Kubernetes HA mode because 
BlobServer returns short pod hostnames (e.g., {{{}flink-74b96b4f8-xc49r{}}}) 
instead of resolvable addresses. 
 * Job submission from inside the cluster fails with {{UnknownHostException}}
 * Affects FlinkDeployment CRs with HA enabled
 * Breaks CI/CD pipelines running in-cluster

h3. Suspected cause

FLINK-38109 [1] changed how BlobServer address resolution works in Flink 2.2. 
The new {{getBlobServerAddress()}} method [2] returns the pod's bind address 
(hostname), but in Kubernetes, short pod hostnames aren't DNS-resolvable 
without a headless service [3]. Flink 1.20 returned IP addresses which were 
directly resolvable.

The change [4] in {{JobSubmitHandler.java}} (lines 192-200) now uses 
{{gateway.getBlobServerAddress()}} instead of {{{}gateway.getHostname(){}}}.
h3. Reproduction

1/ Deploy Flink 2.2.0 with HA in Kubernetes:
 
{code}
apiVersion: flink.apache.org/v1beta1
kind: FlinkDeployment
metadata:
  name: flink-example
spec:
  image: flink:2.2.0
  flinkVersion: v2_2
  flinkConfiguration:
    high-availability.type: kubernetes
    high-availability.storageDir: file:///opt/flink/ha
  jobManager:
    resource:
      memory: "2048m"
  taskManager:
    resource:
      memory: "2048m"{code}
2/ Submit job from inside JobManager pod:
{code:sh}
kubectl exec -n <namespace> <jobmanager-pod> -- \
  flink run /opt/flink/examples/streaming/StateMachineExample.jar{code}
 {*}Result{*}: Fails with UnknownHostException: flink-74b96b4f8-xc49r
{code:java}
Caused by: java.io.IOException: Could not connect to BlobServer at address 
flink-74b96b4f8-xc49r/<unresolved>:6124 at 
org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:103) at 
org.apache.flink.runtime.rest.handler.job.JobSubmitHandler.lambda$uploadJobGraphFiles$4(JobSubmitHandler.java:199)
 Caused by: java.net.UnknownHostException: flink-74b96b4f8-xc49r  {code}
The address flink-74b96b4f8-xc49r is a short pod hostname, not resolvable 
without FQDN suffix or additional Kubernetes configuration.
h3. Tested Workaround

Add a headless service. Example:
{code}
apiVersion: v1
kind: Service
metadata:
  name: flink-headless
spec:
  clusterIP: None
  publishNotReadyAddresses: true
  selector:
    app: flink
    component: jobmanager
  ports:
  - name: blob
    port: 6124
  - name: rpc
    port: 6123 {code}
 * resolves the DNS issue and allows job submission to work
 * Does not negatively impact deployments without HA enabled
 * Headless services are a standard Kubernetes pattern for enabling pod 
hostname resolution [5][6]

h3. Follow-up / questions
 # Is returning pod short hostnames an intended behavior?
 # Are there side effects or edge cases with the headless service approach that 
need consideration?
 # Are there other solutions besides headless service?

If headless service is an acceptable solution:
 * Should it be documented as a requirement for HA in Kubernetes (and where)?
 * Should it be created automatically by the Flink Kubernetes Operator?

h2. References

[1] FLINK-38109 - https://issues.apache.org/jira/browse/FLINK-38109

[2] Commit introducing the change - 
[https://github.com/apache/flink/commit/deee02b665c01d56d43da4df2eadfbbef333ec3c]

[3] Kubernetes DNS for Services and Pods - Pod hostname DNS resolution - 
[https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-hostname-and-subdomain-field]

[4] Affected code - 
[https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/job/JobSubmitHandler.java#L192-L200]

[5] Kubernetes Documentation - Headless Services - 
[https://kubernetes.io/docs/concepts/services-networking/service/#headless-services]

[6] StatefulSets and Headless Services - 
[https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#stable-network-id]

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to