[
https://issues.apache.org/jira/browse/FLINK-34227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17825654#comment-17825654
]
Matthias Pohl edited comment on FLINK-34227 at 3/12/24 1:09 PM:
----------------------------------------------------------------
The [findings of my initial
analysis|https://issues.apache.org/jira/browse/FLINK-34227?focusedCommentId=17810745&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17810745]
are not correct. The missing log message does exist. It's just that the
"{{Close ResourceManager connection [...]}}" log message appears twice (once
triggered from the JobMaster's IO thread and once from the Dispatcher's main
thread). The latter one seems to retrigger the reconnection.
{code}
[...]
02:51:28,193 [flink-pekko.actor.default-dispatcher-10] INFO
org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Remove job
e7cb13faaae707768a1a4db28427af80 from job leader monitoring.
02:51:28,193 [flink-pekko.actor.default-dispatcher-10] INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Close
JobManager connection for job e7cb13faaae707768a1a4db28427af80.
02:51:28,193 [flink-pekko.actor.default-dispatcher-8] INFO
org.apache.flink.runtime.resourcemanager.slotmanager.DefaultSlotStatusSyncer []
- Freeing slot 98a0c702ce550d2fd7dd3710ec7b76e0.
02:51:28,194 [flink-pekko.actor.default-dispatcher-8] INFO
org.apache.flink.runtime.jobmaster.JobMaster [] - Disconnect
TaskExecutor d71ee9b8-f278-48ee-bb1c-f05fd568947f because: TaskExecutor
pekko://flink/user/rpc/taskmanager_0 has no more allocated slots for job
e7cb13faaae707768a1a4db28427af80.
02:51:28,194 [jobmanager-io-thread-3] INFO
org.apache.flink.runtime.jobmaster.JobMaster [] - Close
ResourceManager connection 3c08958c5ef3906fae847097373b047a: Stopping JobMaster
for job 'Flink Streaming Job' (e7cb13faaae707768a1a4db28427af80).
02:51:28,194 [flink-pekko.actor.default-dispatcher-5] INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Disconnect job manager
a38b8b4ba6c4894c7cfca5f1c0fe4f68@pekko://flink/user/rpc/jobmanager_70 for job
e7cb13faaae707768a1a4db28427af80 from the resource manager.
02:51:28,194 [flink-pekko.actor.default-dispatcher-5] INFO
org.apache.flink.runtime.jobmaster.JobMaster [] - Close
ResourceManager connection 3c08958c5ef3906fae847097373b047a: Stopping JobMaster
for job 'Flink Streaming Job' (e7cb13faaae707768a1a4db28427af80).
02:51:28,194 [flink-pekko.actor.default-dispatcher-5] INFO
org.apache.flink.runtime.jobmaster.JobMaster [] - Connecting to
ResourceManager
pekko://flink/user/rpc/resourcemanager_2(86dfd2ebd79836698df3e4a5de474282)
02:51:28,194 [flink-pekko.actor.default-dispatcher-5] INFO
org.apache.flink.runtime.jobmaster.JobMaster [] - Resolved
ResourceManager address, beginning registration
02:51:28,194 [flink-pekko.actor.default-dispatcher-5] INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Registering job manager
a38b8b4ba6c4894c7cfca5f1c0fe4f68@pekko://flink/user/rpc/jobmanager_70 for job
e7cb13faaae707768a1a4db28427af80.
02:51:28,195 [flink-pekko.actor.default-dispatcher-5] INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Registered job manager
a38b8b4ba6c4894c7cfca5f1c0fe4f68@pekko://flink/user/rpc/jobmanager_70 for job
e7cb13faaae707768a1a4db28427af80.
02:51:28,195 [flink-pekko.actor.default-dispatcher-5] INFO
org.apache.flink.runtime.jobmaster.JobMaster [] - JobManager
successfully registered at ResourceManager, leader id:
86dfd2ebd79836698df3e4a5de474282.
[...]
{code}
was (Author: mapohl):
The [findings of my initial
analysis|https://issues.apache.org/jira/browse/FLINK-34227?focusedCommentId=17810745&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17810745]
are not correct. The missing log message does exist. It's just that the
"{{Close ResourceManager connection [...]}}" log message appears twice (once
triggered from the JobMaster's IO thread and once from the Dispatcher's main
thread). The latter one seems to retrigger the reconnection.
> Job doesn't disconnect from ResourceManager
> -------------------------------------------
>
> Key: FLINK-34227
> URL: https://issues.apache.org/jira/browse/FLINK-34227
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.19.0, 1.18.1
> Reporter: Matthias Pohl
> Assignee: Matthias Pohl
> Priority: Critical
> Labels: github-actions, test-stability
> Attachments: FLINK-34227.7e7d69daebb438b8d03b7392c9c55115.log,
> FLINK-34227.log
>
>
> https://github.com/XComp/flink/actions/runs/7634987973/job/20800205972#step:10:14557
> {code}
> [...]
> "main" #1 prio=5 os_prio=0 tid=0x00007fcccc4b7000 nid=0x24ec0 waiting on
> condition [0x00007fccce1eb000]
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x00000000bdd52618> (a
> java.util.concurrent.CompletableFuture$Signaller)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
> at
> java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
> at
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
> at
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
> at
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:2131)
> at
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:2099)
> at
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:2077)
> at
> org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:876)
> at
> org.apache.flink.table.planner.runtime.stream.sql.WindowDistinctAggregateITCase.testHopWindow_Cube(WindowDistinctAggregateITCase.scala:550)
> [...]
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)