Márton Balassi created FLINK-26804:
--------------------------------------
Summary: Operator e2e tests sporadically fail: DEPLOYED_NOT_READY
Key: FLINK-26804
URL: https://issues.apache.org/jira/browse/FLINK-26804
Project: Flink
Issue Type: Bug
Components: Kubernetes Operator
Reporter: Márton Balassi
Assignee: Márton Balassi
I managed to introduce a sporadic failure scenario for the e2e tests via my
solution of FLINK-26715. Since the operator only checks on the job every couple
second the job might still be observed as being in DEPLOYED_NOT_READY state
even after successfully completing checkpoints.
{code:bash}
Run ls e2e-tests/test_*.sh | while read script_test;do \
Running e2e-tests/test_kubernetes_application_ha.sh
persistentvolumeclaim/flink-example-statemachine created
Error from server (InternalError): error when creating
"e2e-tests/data/cr.yaml": Internal error occurred: failed calling webhook
"vflinkdeployments.flink.apache.org": failed to call webhook: Post
"https://flink-operator-webhook-service.default.svc:443/validate?timeout=10s":
dial tcp 10.106.63.26:443: connect: connection refused
Command: kubectl apply -f e2e-tests/data/cr.yaml failed. Retrying...
flinkdeployment.flink.apache.org/flink-example-statemachine created
persistentvolumeclaim/flink-example-statemachine unchanged
Error from server (NotFound): deployments.apps "flink-example-statemachine" not
found
Command: kubectl get deploy/flink-example-statemachine failed. Retrying...
NAME READY UP-TO-DATE AVAILABLE AGE
flink-example-statemachine 0/1 1 0 1s
deployment.apps/flink-example-statemachine condition met
Waiting for jobmanager pod flink-example-statemachine-7fcf55c88b-h5r7r ready.
pod/flink-example-statemachine-7fcf55c88b-h5r7r condition met
Waiting for log "Rest endpoint listening at"...
Log "Rest endpoint listening at" shows up.
Waiting for log "Completed checkpoint
[0-[9](https://github.com/apache/flink-kubernetes-operator/runs/5640468148?check_suite_focus=true#step:9:9)]+
for job"...
Log "Completed checkpoint [0-9]+ for job" shows up.
Successfully verified that
flinkdep/flink-example-statemachine.status.jobManagerDeploymentStatus is in
READY state.
Successfully verified that
flinkdep/flink-example-statemachine.status.jobStatus.state is in RUNNING state.
Kill the flink-example-statemachine-7fcf55c88b-h5r7r
Defaulted container "flink-main-container" out of: flink-main-container,
artifacts-fetcher (init)
Waiting for log "Restoring job 00000000000000000000000000000000 from
Checkpoint"...
Log "Restoring job 00000000000000000000000000000000 from Checkpoint" shows up.
Waiting for log "Completed checkpoint [0-9]+ for job"...
Log "Completed checkpoint [0-9]+ for job" shows up.
Status verification for
flinkdep/flink-example-statemachine.status.jobManagerDeploymentStatus failed.
It is DEPLOYED_NOT_READY instead of READY.
Debugging failed e2e test:
{code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)