[
https://issues.apache.org/jira/browse/FLINK-32552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17743096#comment-17743096
]
Fabio Wanner commented on FLINK-32552:
--------------------------------------
It's definitely not a bug in the operator:
I deployed 4 different jobs at the same time and observed the following:
4 calls are made of the method JarRunHandler.handleRequest() and when looking
at the request 4 distinct ids for the 4 jobs are present, but the
EmbeddedExecutor's submitAndGetJobClientFuture() method (also called 4 times)
will have 3 distinct optJobIds and one duplicate, leading to different flavors
of the problem described in this but ticket (depending on the exact timing of
the parallel job launches).
I will open a new bug ticket with the relevant information and link it here.
> Mixed up Flink session job deployments
> --------------------------------------
>
> Key: FLINK-32552
> URL: https://issues.apache.org/jira/browse/FLINK-32552
> Project: Flink
> Issue Type: Bug
> Components: Kubernetes Operator
> Reporter: Fabio Wanner
> Priority: Major
>
> *Context*
> In the scope of end-to-end tests we deploy all the Flink session jobs we have
> regularly in a staging environment. Some of the jobs are bundled together in
> one helm chart and therefore deployed at the same time. There are around 40
> individual Flink jobs (running on the same Flink session cluster). The
> session cluster is individual for each e2e test run. The problems described
> below happen scarcely (1 in ~ 50 run maybe).
> *Problem*
> Rarely the operator seems to "mix up" the deployments. This can be seen in
> the Flink cluster logs as multiple {{Received JobGraph submission '<JOB
> NAME>' (<JOB_ID>)}} logs are created from jobs with the same job_id. This
> results in errors such as:
> {{DuplicateJobSubmissionException}} or {{ClassNotFoundException.}}
> It' also visible in the FlinkSessionJob resource: status.jobStatus.jobName
> does not match the expected job name of the job being deployed (The job name
> is passed to the application via argument).
> So far we were unable to reliably reproduce the error.
> *Details*
> The following lines show the status of 3 jobs form the view point of the
> Flink cluster dashboard, and the FlinkSessionJob ressource:
>
> *aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615*
> Apache Flink Dashboard:
> * State: Restarting
> * ID: a7d36f3881f943a00000000000000002
> * Exceptions: Cannot load user class:
> aelps.pipelines.aletsch.smc.SMCUrlMapper
> FlinkSessionJob Ressource:
> * State: RUNNING
> * jobId: a1221c743367497b0000000000000002
> * uid: a1221c74-3367-497b-ad2f-8793ab23919d
>
> *aletsch_mat_e5730831db8092adb12f5189c4c895ef3a268615*
> Apache Flink Dashboard:
> * State: -
> * ID: -
> FlinkSessionJob Ressource:
> * State: UPGRADING
> * jobId: -
> * uid: a7d36f38-81f9-43a0-898f-19b950430e9d
> Flink K8s Operator:
> * Exceptions: DuplicateJobSubmissionException: Job has already been
> submitted.
>
> *aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615*
> Apache Flink Dashboard:
> * State: Running
> * ID: e692c2dfaa18441c0000000000000002
> * Exceptions: -
> FlinkSessionJob Ressource:
> * State: RUNNING
> * jobId: e692c2dfaa18441c0000000000000002
> * uid: e692c2df-aa18-441c-a352-88aefa9a3017
> As we can see the *aletsch_smc* job is presumably running according to the
> FlinkSessionJob resource, but crash-looping in the cluster and it has the
> jobID matching the uid of the resource of {*}aletsch_mat{*}. While
> *aletsch_mat* is not even running. The following logs also show some
> suspicious entries: There are several {{Received JobGraph submission}} from
> different jobs with the same jobID.
>
> *Logs*
> The logs are filtered by the 3 jobIds from above.
>
> JobID: a7d36f3881f943a00000000000000002
> {code:bash}
> Flink Cluster
> ...
> 023-07-06 10:23:50,552 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job
> aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615
> (a7d36f3881f943a00000000000000002) switched from state RUNNING to RESTARTING.
> 2023-07-06 10:23:50 file:
> '/tmp/tm_10.0.11.159:6122-e9fadc/blobStorage/job_a7d36f3881f943a00000000000000002/blob_p-40c7a30adef8868254191d2cf2dbc4cb7ab46f0d-8a02a0583d91c5e8e6c94f378aa444c2'
> (valid JAR)
> 2023-07-06 10:23:50,522 INFO
> org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager
> [] - Received resource requirements from job
> a7d36f3881f943a00000000000000002:
> [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN},
> numberOfRequiredSlots=4}]
> 2023-07-06 10:23:50,522 INFO
> org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager
> [] - Received resource requirements from job
> a7d36f3881f943a00000000000000002:
> [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN},
> numberOfRequiredSlots=3}]
> 2023-07-06 10:23:50,522 INFO
> org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager
> [] - Received resource requirements from job
> a7d36f3881f943a00000000000000002:
> [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN},
> numberOfRequiredSlots=2}]
> 2023-07-06 10:23:50,522 INFO
> org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager
> [] - Received resource requirements from job
> a7d36f3881f943a00000000000000002:
> [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN},
> numberOfRequiredSlots=1}]
> 2023-07-06 10:23:50,512 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job
> aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615
> (a7d36f3881f943a00000000000000002) switched from state RESTARTING to RUNNING.
> 2023-07-06 10:23:48,979 INFO
> org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager
> [] - Clearing resource requirements of job a7d36f3881f943a00000000000000002
> 2023-07-06 10:23:48,853 INFO
> org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager
> [] - Received resource requirements from job
> a7d36f3881f943a00000000000000002:
> [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN},
> numberOfRequiredSlots=1}]
> 2023-07-06 10:23:48,853 INFO
> org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager
> [] - Received resource requirements from job
> a7d36f3881f943a00000000000000002:
> [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN},
> numberOfRequiredSlots=2}]
> 2023-07-06 10:23:48,853 INFO
> org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager
> [] - Received resource requirements from job
> a7d36f3881f943a00000000000000002:
> [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN},
> numberOfRequiredSlots=3}]
> 2023-07-06 10:23:48 file:
> '/tmp/tm_10.0.11.159:6122-e9fadc/blobStorage/job_a7d36f3881f943a00000000000000002/blob_p-40c7a30adef8868254191d2cf2dbc4cb7ab46f0d-8a02a0583d91c5e8e6c94f378aa444c2'
> (valid JAR)
> 2023-07-06 10:23:48,661 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job
> aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615
> (a7d36f3881f943a00000000000000002) switched from state RUNNING to RESTARTING.
> 2023-07-06 10:23:48,583 INFO
> org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager
> [] - Received resource requirements from job
> a7d36f3881f943a00000000000000002:
> [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN},
> numberOfRequiredSlots=4}]
> 2023-07-06 10:23:48,583 INFO
> org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager
> [] - Received resource requirements from job
> a7d36f3881f943a00000000000000002:
> [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN},
> numberOfRequiredSlots=3}]
> 2023-07-06 10:23:48,583 INFO
> org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager
> [] - Received resource requirements from job
> a7d36f3881f943a00000000000000002:
> [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN},
> numberOfRequiredSlots=2}]
> 2023-07-06 10:23:48,582 INFO
> org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager
> [] - Received resource requirements from job
> a7d36f3881f943a00000000000000002:
> [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN},
> numberOfRequiredSlots=1}]
> 2023-07-06 10:23:48,573 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job
> aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615
> (a7d36f3881f943a00000000000000002) switched from state RESTARTING to RUNNING.
> 2023-07-06 10:23:47,562 INFO
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Received
> JobGraph submission 'aletsch_mat_e5730831db8092adb12f5189c4c895ef3a268615'
> (a7d36f3881f943a00000000000000002).
> 2023-07-06 10:23:47,518 INFO
> org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager
> [] - Clearing resource requirements of job a7d36f3881f943a00000000000000002
> 2023-07-06 10:23:47,517 INFO
> org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager
> [] - Received resource requirements from job
> a7d36f3881f943a00000000000000002:
> [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN},
> numberOfRequiredSlots=1}]
> 2023-07-06 10:23:47,517 INFO
> org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager
> [] - Received resource requirements from job
> a7d36f3881f943a00000000000000002:
> [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN},
> numberOfRequiredSlots=2}]
> 2023-07-06 10:23:47,516 INFO
> org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager
> [] - Received resource requirements from job
> a7d36f3881f943a00000000000000002:
> [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN},
> numberOfRequiredSlots=3}]
> 2023-07-06 10:23:47,463 INFO
> org.apache.flink.client.deployment.application.executors.EmbeddedExecutor []
> - Submitting Job with JobId=a7d36f3881f943a00000000000000002.
> 2023-07-06 10:23:47,463 INFO
> org.apache.flink.client.deployment.application.executors.EmbeddedExecutor []
> - Job a7d36f3881f943a00000000000000002 is submitted.
> 2023-07-06 10:23:47,104 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job
> aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615
> (a7d36f3881f943a00000000000000002) switched from state RUNNING to RESTARTING.
> 2023-07-06 10:23:46,804 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Offer
> reserved slots to the leader of job a7d36f3881f943a00000000000000002.
> 2023-07-06 10:23:46,804 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Establish
> JobManager connection for job a7d36f3881f943a00000000000000002.
> 2023-07-06 10:23:46,799 INFO
> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Successful
> registration at job manager
> akka.tcp://[email protected]:6123/user/rpc/jobmanager_2 for job
> a7d36f3881f943a00000000000000002.
> 2023-07-06 10:23:46,577 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Receive
> slot request 221b24b50413805c9e35d7620b8a00b8 for job
> a7d36f3881f943a00000000000000002 from resource manager with leader id
> aaa9331f70b07a195b5f09d57d1b40c5.
> 2023-07-06 10:23:46,577 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Receive
> slot request 49d3c8cd1080bd38c0144c3d3cc597cd for job
> a7d36f3881f943a00000000000000002 from resource manager with leader id
> aaa9331f70b07a195b5f09d57d1b40c5.
> 2023-07-06 10:23:46,577 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Receive
> slot request 819f34cc8957066478fb4b3549367d24 for job
> a7d36f3881f943a00000000000000002 from resource manager with leader id
> aaa9331f70b07a195b5f09d57d1b40c5.
> 2023-07-06 10:23:46,574 INFO
> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Add job
> a7d36f3881f943a00000000000000002 for job leader monitoring.
> 2023-07-06 10:23:46,570 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Receive
> slot request 36802a7de1487f3fb1b6a3b509bd5e20 for job
> a7d36f3881f943a00000000000000002 from resource manager with leader id
> aaa9331f70b07a195b5f09d57d1b40c5.
> 2023-07-06 10:23:46,560 INFO
> org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager
> [] - Received resource requirements from job
> a7d36f3881f943a00000000000000002:
> [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN},
> numberOfRequiredSlots=4}]
> 2023-07-06 10:23:46,556 INFO
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registered job manager
> [email protected]://[email protected]:6123/user/rpc/jobmanager_2
> for job a7d36f3881f943a00000000000000002.
> 2023-07-06 10:23:46,528 INFO
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registering job manager
> [email protected]://[email protected]:6123/user/rpc/jobmanager_2
> for job a7d36f3881f943a00000000000000002.
> 2023-07-06 10:23:46,480 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job
> aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615
> (a7d36f3881f943a00000000000000002) switched from state CREATED to RUNNING.
> 2023-07-06 10:23:46,476 INFO
> org.apache.flink.runtime.jobmaster.JobMaster [] - Starting
> execution of job 'aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615'
> (a7d36f3881f943a00000000000000002) under job master id
> aaa9331f70b07a195b5f09d57d1b40c5.
> 2023-07-06 10:23:46,466 INFO
> org.apache.flink.runtime.jobmaster.JobMaster [] - Using
> failover strategy
> org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy@62877000
> for aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615
> (a7d36f3881f943a00000000000000002).
> 2023-07-06 10:23:46,079 INFO
> org.apache.flink.runtime.jobmaster.JobMaster [] - Running
> initialization on master for job
> aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615
> (a7d36f3881f943a00000000000000002).
> 2023-07-06 10:23:46,059 INFO
> org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils [] -
> Found 0 checkpoints in
> KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-a7d36f3881f943a00000000000000002-config-map'}.
> 2023-07-06 10:23:46,051 INFO
> org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils [] -
> Recovering checkpoints from
> KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-a7d36f3881f943a00000000000000002-config-map'}.
> 2023-07-06 10:23:46,006 INFO
> org.apache.flink.runtime.jobmaster.JobMaster [] - Using
> restart back off time strategy
> ExponentialDelayRestartBackoffTimeStrategy(initialBackoffMS=1000,
> maxBackoffMS=300000, backoffMultiplier=2.0, resetBackoffThresholdMS=3600000,
> jitterFactor=0.5, currentBackoffMS=1000, lastFailureTimestamp=0) for
> aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615
> (a7d36f3881f943a00000000000000002).
> 2023-07-06 10:23:45,987 INFO
> org.apache.flink.runtime.jobmaster.JobMaster [] -
> Initializing job 'aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615'
> (a7d36f3881f943a00000000000000002).
> 2023-07-06 10:23:45,966 INFO
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Received
> JobGraph submission
> 'aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615'
> (a7d36f3881f943a00000000000000002).
> 2023-07-06 10:23:45,965 INFO
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Received
> JobGraph submission 'aletsch_mat_e5730831db8092adb12f5189c4c895ef3a268615'
> (a7d36f3881f943a00000000000000002).
> 2023-07-06 10:23:45,915 INFO
> org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Added
> JobGraph(jobId: a7d36f3881f943a00000000000000002) to
> KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-cluster-config-map'}.
> 2023-07-06 10:23:45,859 INFO
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Submitting
> job 'aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615'
> (a7d36f3881f943a00000000000000002).
> 2023-07-06 10:23:45,857 INFO
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Received
> JobGraph submission 'aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615'
> (a7d36f3881f943a00000000000000002).
> 2023-07-06 10:23:45,705 INFO
> org.apache.flink.client.deployment.application.executors.EmbeddedExecutor []
> - Submitting Job with JobId=a7d36f3881f943a00000000000000002.
> 2023-07-06 10:23:45,705 INFO
> org.apache.flink.client.deployment.application.executors.EmbeddedExecutor []
> - Job a7d36f3881f943a00000000000000002 is submitted.
> 2023-07-06 10:23:45,705 INFO
> org.apache.flink.client.deployment.application.executors.EmbeddedExecutor []
> - Submitting Job with JobId=a7d36f3881f943a00000000000000002.
> 2023-07-06 10:23:45,705 INFO
> org.apache.flink.client.deployment.application.executors.EmbeddedExecutor []
> - Job a7d36f3881f943a00000000000000002 is submitted.
> 2023-07-06 10:23:45,705 INFO
> org.apache.flink.client.deployment.application.executors.EmbeddedExecutor []
> - Submitting Job with JobId=a7d36f3881f943a00000000000000002.
> 2023-07-06 10:23:45,705 INFO
> org.apache.flink.client.deployment.application.executors.EmbeddedExecutor []
> - Job a7d36f3881f943a00000000000000002 is submitted.
> Flink Operator
> 2023-07-06 10:26:25,792 o.a.f.k.o.s.AbstractFlinkService [INFO
> ][aelps-staging/aletsch-mat-staging-e5730831] Submitting job:
> a7d36f3881f943a00000000000000002 to session cluster.
> 2023-07-06 10:25:05,163 o.a.f.k.o.s.AbstractFlinkService [INFO
> ][aelps-staging/aletsch-mat-staging-e5730831] Submitting job:
> a7d36f3881f943a00000000000000002 to session cluster.
> 2023-07-06 10:24:24,553 o.a.f.k.o.s.AbstractFlinkService [INFO
> ][aelps-staging/aletsch-mat-staging-e5730831] Submitting job:
> a7d36f3881f943a00000000000000002 to session cluster.
> 2023-07-06 10:24:03,850 o.a.f.k.o.s.AbstractFlinkService [INFO
> ][aelps-staging/aletsch-mat-staging-e5730831] Submitting job:
> a7d36f3881f943a00000000000000002 to session cluster.
> 2023-07-06 10:23:53,094 o.a.f.k.o.s.AbstractFlinkService [INFO
> ][aelps-staging/aletsch-mat-staging-e5730831] Submitting job:
> a7d36f3881f943a00000000000000002 to session cluster.
> 2023-07-06 10:23:47,346 o.a.f.k.o.s.AbstractFlinkService [INFO
> ][aelps-staging/aletsch-mat-staging-e5730831] Submitting job:
> a7d36f3881f943a00000000000000002 to session cluster.
> 2023-07-06 10:23:45,372 o.a.f.k.o.s.AbstractFlinkService [INFO
> ][aelps-staging/aletsch-mat-staging-e5730831] Submitting job:
> a7d36f3881f943a00000000000000002 to session cluster.
> {code}
>
> JobID: a1221c743367497b0000000000000002
> {code:bash}
> Flink Cluster
> 2023-07-06 11:23:48,062 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed
> checkpoint 1 for job a1221c743367497b0000000000000002 (48548 bytes,
> checkpointDuration=107 ms, finalizationTime=33 ms).
> 2023-07-06 11:23:47,937 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering
> checkpoint 1 (type=CheckpointType{name='Checkpoint',
> sharingFilesStrategy=FORWARD_BACKWARD}) @ 1688635427922 for job
> a1221c743367497b0000000000000002.
> 2023-07-06 10:23:48,567 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Offer
> reserved slots to the leader of job a1221c743367497b0000000000000002.
> 2023-07-06 10:23:48,567 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Establish
> JobManager connection for job a1221c743367497b0000000000000002.
> 2023-07-06 10:23:48,567 INFO
> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Successful
> registration at job manager
> akka.tcp://[email protected]:6123/user/rpc/jobmanager_7 for job
> a1221c743367497b0000000000000002.
> 2023-07-06 10:23:48,009 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Receive
> slot request cae6932e2409d5fece3f6b4636e3c71a for job
> a1221c743367497b0000000000000002 from resource manager with leader id
> aaa9331f70b07a195b5f09d57d1b40c5.
> 2023-07-06 10:23:48,003 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Receive
> slot request 8a57f3ecff07d300aebb33f6b3545aed for job
> a1221c743367497b0000000000000002 from resource manager with leader id
> aaa9331f70b07a195b5f09d57d1b40c5.
> 2023-07-06 10:23:48,003 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Receive
> slot request 7a4a0cfd16eec4a1cb043cce5f989db0 for job
> a1221c743367497b0000000000000002 from resource manager with leader id
> aaa9331f70b07a195b5f09d57d1b40c5.
> 2023-07-06 10:23:48,002 INFO
> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Add job
> a1221c743367497b0000000000000002 for job leader monitoring.
> 2023-07-06 10:23:48,002 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Receive
> slot request 92cbc64513fa703e4acf28bbb3088a58 for job
> a1221c743367497b0000000000000002 from resource manager with leader id
> aaa9331f70b07a195b5f09d57d1b40c5.
> 2023-07-06 10:23:48,999 INFO
> org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager
> [] - Received resource requirements from job
> a1221c743367497b0000000000000002:
> [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN},
> numberOfRequiredSlots=4}]
> 2023-07-06 10:23:47,998 INFO
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registered job manager
> [email protected]://[email protected]:6123/user/rpc/jobmanager_7
> for job a1221c743367497b0000000000000002.
> 2023-07-06 10:23:47,953 INFO
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registering job manager
> [email protected]://[email protected]:6123/user/rpc/jobmanager_7
> for job a1221c743367497b0000000000000002.
> 2023-07-06 10:23:47,922 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job
> aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615
> (a1221c743367497b0000000000000002) switched from state CREATED to RUNNING.
> 2023-07-06 10:23:47,887 INFO
> org.apache.flink.runtime.jobmaster.JobMaster [] - Starting
> execution of job 'aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615'
> (a1221c743367497b0000000000000002) under job master id
> aaa9331f70b07a195b5f09d57d1b40c5.
> 2023-07-06 10:23:47,887 INFO
> org.apache.flink.runtime.jobmaster.JobMaster [] - Using
> failover strategy
> org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy@2222ba4d
> for aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615
> (a1221c743367497b0000000000000002).
> 2023-07-06 10:23:47,880 INFO
> org.apache.flink.runtime.jobmaster.JobMaster [] - Running
> initialization on master for job
> aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615
> (a1221c743367497b0000000000000002).
> 2023-07-06 10:23:47,872 INFO
> org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils [] -
> Found 0 checkpoints in
> KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-a1221c743367497b0000000000000002-config-map'}.
> 2023-07-06 10:23:47,867 INFO
> org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils [] -
> Recovering checkpoints from
> KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-a1221c743367497b0000000000000002-config-map'}.
> 2023-07-06 10:23:47,832 INFO
> org.apache.flink.runtime.jobmaster.JobMaster [] - Using
> restart back off time strategy
> ExponentialDelayRestartBackoffTimeStrategy(initialBackoffMS=1000,
> maxBackoffMS=300000, backoffMultiplier=2.0, resetBackoffThresholdMS=3600000,
> jitterFactor=0.5, currentBackoffMS=1000, lastFailureTimestamp=0) for
> aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615
> (a1221c743367497b0000000000000002).
> 2023-07-06 10:23:47,832 INFO
> org.apache.flink.runtime.jobmaster.JobMaster [] -
> Initializing job 'aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615'
> (a1221c743367497b0000000000000002).
> 2023-07-06 10:23:47,820 INFO
> org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Added
> JobGraph(jobId: a1221c743367497b0000000000000002) to
> KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-cluster-config-map'}.
> 2023-07-06 10:23:47,780 INFO
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Submitting
> job 'aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615'
> (a1221c743367497b0000000000000002).
> 2023-07-06 10:23:47,776 INFO
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Received
> JobGraph submission 'aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615'
> (a1221c743367497b0000000000000002).
> 2023-07-06 10:23:47,668 INFO
> org.apache.flink.client.deployment.application.executors.EmbeddedExecutor []
> - Submitting Job with JobId=a1221c743367497b0000000000000002.
> 2023-07-06 10:23:47,668 INFO
> org.apache.flink.client.deployment.application.executors.EmbeddedExecutor []
> - Job a1221c743367497b0000000000000002 is submitted.
> Flink Operator
> 2023-07-06 10:23:48,007 o.a.f.k.o.s.AbstractFlinkService [INFO
> ][aelps-staging/aletsch-smc-staging-e5730831] Submitted job:
> a1221c743367497b0000000000000002 to session cluster.
> 2023-07-06 10:23:47,505 o.a.f.k.o.s.AbstractFlinkService [INFO
> ][aelps-staging/aletsch-smc-staging-e5730831] Submitting job:
> a1221c743367497b0000000000000002 to session cluster.
> 2023-07-06 10:23:45,416 o.a.f.k.o.s.AbstractFlinkService [INFO
> ][aelps-staging/aletsch-smc-staging-e5730831] Submitting job:
> a1221c743367497b0000000000000002 to session cluster.
> {code}
> JobID: e692c2dfaa18441c0000000000000002
> {code:bash}
> Flink Cluster
> 2023-07-06 11:23:48,004 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed
> checkpoint 1 for job e692c2dfaa18441c0000000000000002 (8194 bytes,
> checkpointDuration=125 ms, finalizationTime=28 ms).
> 2023-07-06 11:23:47,867 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering
> checkpoint 1 (type=CheckpointType{name='Checkpoint',
> sharingFilesStrategy=FORWARD_BACKWARD}) @ 1688635427851 for job
> e692c2dfaa18441c0000000000000002.
> 2023-07-06 10:23:48,568 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Offer
> reserved slots to the leader of job e692c2dfaa18441c0000000000000002.
> 2023-07-06 10:23:48,568 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Establish
> JobManager connection for job e692c2dfaa18441c0000000000000002.
> 2023-07-06 10:23:48,568 INFO
> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Successful
> registration at job manager
> akka.tcp://[email protected]:6123/user/rpc/jobmanager_6 for job
> e692c2dfaa18441c0000000000000002.
> 2023-07-06 10:23:48,002 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Receive
> slot request 5e5a0e55fac280bf31abf29a20bce684 for job
> e692c2dfaa18441c0000000000000002 from resource manager with leader id
> aaa9331f70b07a195b5f09d57d1b40c5.
> 2023-07-06 10:23:48,002 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Receive
> slot request 1cdbce54f4376a1df86430f97dab6858 for job
> e692c2dfaa18441c0000000000000002 from resource manager with leader id
> aaa9331f70b07a195b5f09d57d1b40c5.
> 2023-07-06 10:23:48,002 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Receive
> slot request 352db7288d0e4d1775d5f52dd14c769d for job
> e692c2dfaa18441c0000000000000002 from resource manager with leader id
> aaa9331f70b07a195b5f09d57d1b40c5.
> 2023-07-06 10:23:48,001 INFO
> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Add job
> e692c2dfaa18441c0000000000000002 for job leader monitoring.
> 2023-07-06 10:23:48,000 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Receive
> slot request bffed3e4a4c8573049a4119bd7e15f19 for job
> e692c2dfaa18441c0000000000000002 from resource manager with leader id
> aaa9331f70b07a195b5f09d57d1b40c5.
> 2023-07-06 10:23:48,998 INFO
> org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager
> [] - Received resource requirements from job
> e692c2dfaa18441c0000000000000002:
> [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN},
> numberOfRequiredSlots=4}]
> 2023-07-06 10:23:47,998 INFO
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registered job manager
> [email protected]://[email protected]:6123/user/rpc/jobmanager_6
> for job e692c2dfaa18441c0000000000000002.
> 2023-07-06 10:23:47,953 INFO
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registering job manager
> [email protected]://[email protected]:6123/user/rpc/jobmanager_6
> for job e692c2dfaa18441c0000000000000002.
> 2023-07-06 10:23:47,851 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job
> aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615
> (e692c2dfaa18441c0000000000000002) switched from state CREATED to RUNNING.
> 2023-07-06 10:23:47,845 INFO
> org.apache.flink.runtime.jobmaster.JobMaster [] - Starting
> execution of job 'aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615'
> (e692c2dfaa18441c0000000000000002) under job master id
> aaa9331f70b07a195b5f09d57d1b40c5.
> 2023-07-06 10:23:47,844 INFO
> org.apache.flink.runtime.jobmaster.JobMaster [] - Using
> failover strategy
> org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy@7eeab246
> for aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615
> (e692c2dfaa18441c0000000000000002).
> 2023-07-06 10:23:47,834 INFO
> org.apache.flink.runtime.jobmaster.JobMaster [] - Running
> initialization on master for job
> aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615
> (e692c2dfaa18441c0000000000000002).
> 2023-07-06 10:23:47,825 INFO
> org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils [] -
> Found 0 checkpoints in
> KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-e692c2dfaa18441c0000000000000002-config-map'}.
> 2023-07-06 10:23:47,813 INFO
> org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils [] -
> Recovering checkpoints from
> KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-e692c2dfaa18441c0000000000000002-config-map'}.
> 2023-07-06 10:23:47,782 INFO
> org.apache.flink.runtime.jobmaster.JobMaster [] - Using
> restart back off time strategy
> ExponentialDelayRestartBackoffTimeStrategy(initialBackoffMS=1000,
> maxBackoffMS=300000, backoffMultiplier=2.0, resetBackoffThresholdMS=3600000,
> jitterFactor=0.5, currentBackoffMS=1000, lastFailureTimestamp=0) for
> aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615
> (e692c2dfaa18441c0000000000000002).
> 2023-07-06 10:23:47,781 INFO
> org.apache.flink.runtime.jobmaster.JobMaster [] -
> Initializing job 'aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615'
> (e692c2dfaa18441c0000000000000002).
> 2023-07-06 10:23:47,774 INFO
> org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Added
> JobGraph(jobId: e692c2dfaa18441c0000000000000002) to
> KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-cluster-config-map'}.
> 2023-07-06 10:23:47,703 INFO
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Submitting
> job 'aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615'
> (e692c2dfaa18441c0000000000000002).
> 2023-07-06 10:23:47,702 INFO
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Received
> JobGraph submission
> 'aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615'
> (e692c2dfaa18441c0000000000000002).
> 2023-07-06 10:23:47,650 INFO
> org.apache.flink.client.deployment.application.executors.EmbeddedExecutor []
> - Submitting Job with JobId=e692c2dfaa18441c0000000000000002.
> 2023-07-06 10:23:47,650 INFO
> org.apache.flink.client.deployment.application.executors.EmbeddedExecutor []
> - Job e692c2dfaa18441c0000000000000002 is submitted.
> Flink Operator
> 2023-07-06 10:23:47,973 o.a.f.k.o.s.AbstractFlinkService [INFO
> ][aelps-staging/aletsch-wp-wafer-staging-e5730831] Submitted job:
> e692c2dfaa18441c0000000000000002 to session cluster.
> 2023-07-06 10:23:47,505 o.a.f.k.o.s.AbstractFlinkService [INFO
> ][aelps-staging/aletsch-wp-wafer-staging-e5730831] Submitting job:
> e692c2dfaa18441c0000000000000002 to session cluster.
> 2023-07-06 10:23:45,374 o.a.f.k.o.s.AbstractFlinkService [INFO
> ][aelps-staging/aletsch-wp-wafer-staging-e5730831] Submitting job:
> e692c2dfaa18441c0000000000000002 to session cluster.
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)