zentol commented on a change in pull request #18678:
URL: https://github.com/apache/flink/pull/18678#discussion_r803465610
##########
File path:
flink-yarn-tests/src/test/java/org/apache/flink/yarn/YarnTestBase.java
##########
@@ -273,6 +274,8 @@ public void setupYarnClient() {
}
flinkConfiguration = new
org.apache.flink.configuration.Configuration(globalConfiguration);
+ flinkConfiguration.setString(RestOptions.ADDRESS.key(), "0.0.0.0");
+ flinkConfiguration.setString(RestOptions.BIND_ADDRESS.key(),
"0.0.0.0");
Review comment:
I think we should look into this further.
It is not clear to me how changing the bind address fixes this. The TM can
clearly register at the JM.
The only trace I found in the JM logs is this sequence:
```
INFO o.a.f.yarn.YarnResourceManagerDriver [] - TaskExecutor
container_1644315356936_0001_01_000002(083c75858fb9:33809) will be started on
083c75858fb9 with TaskExecutorProcessSpec {cpuCores=2.0,
frameworkHeapSize=128.000mb (134217728 bytes), frameworkOffHeapSize=128.000mb
(134217728 bytes), taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0
bytes, networkMemSize=64.000mb (67108864 bytes), managedMemorySize=230.400mb
(241591914 bytes), jvmMetaspaceSize=256.000mb (268435456 bytes),
jvmOverheadSize=192.000mb (201326592 bytes), numSlots=2}.
INFO o.a.f.runtime.resourcemanager.active.ActiveResourceManager [] -
Requested worker container_1644315356936_0001_01_000002(083c75858fb9:33809)
with resource spec WorkerResourceSpec {cpuCores=2.0, taskHeapSize=25.600mb
(26843542 bytes), taskOffHeapSize=0 bytes, networkMemSize=64.000mb (67108864
bytes), managedMemSize=230.400mb (241591914 bytes), numSlots=2}.
INFO org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl [] -
Processing Event EventType: START_CONTAINER for Container
container_1644315356936_0001_01_000002
INFO o.a.f.runtime.resourcemanager.active.ActiveResourceManager [] -
Registering TaskManager with ResourceID
container_1644315356936_0001_01_000002(083c75858fb9:33809)
(akka.tcp://flink@083c75858fb9:33093/user/rpc/taskmanager_0) at ResourceManager
INFO o.a.f.runtime.resourcemanager.active.ActiveResourceManager [] - Worker
container_1644315356936_0001_01_000002(083c75858fb9:33809) is registered.
INFO o.a.f.runtime.resourcemanager.active.ActiveResourceManager [] - Worker
container_1644315356936_0001_01_000002(083c75858fb9:33809) with resource spec
WorkerResourceSpec {cpuCores=2.0, taskHeapSize=25.600mb (26843542 bytes),
taskOffHeapSize=0 bytes, networkMemSize=64.000mb (67108864 bytes),
managedMemSize=230.400mb (241591914 bytes), numSlots=2} was requested in
current attempt. Current pending count after registering: 0.
INFO o.a.f.runtime.dispatcher.MiniDispatcher [] - Job
f0690a993ed145c6ebb640d0682b2885 reached terminal state FINISHED.
INFO o.a.f.runtime.jobmaster.JobMaster [] - Stopping the
JobMaster for job 'Flink Java Job at Tue Feb 08 10:15:59 UTC 2022'
(f0690a993ed145c6ebb640d0682b2885).
INFO o.a.f.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool [] -
Releasing slot [064b6440e89800203d7fc93383869f26].
INFO o.a.f.runtime.jobmaster.JobMaster [] - Close
ResourceManager connection 378fc312596e57c7dcdd9cbee98ec674: Stopping JobMaster
for job 'Flink Java Job at Tue Feb 08 10:15:59 UTC 2022'
(f0690a993ed145c6ebb640d0682b2885).
INFO o.a.f.runtime.resourcemanager.active.ActiveResourceManager [] -
Disconnect job manager
[email protected]://flink@083c75858fb9:42211/user/rpc/jobmanager_1
for job f0690a993ed145c6ebb640d0682b2885 from the resource manager.
INFO o.a.f.runtime.resourcemanager.active.ActiveResourceManager [] -
Stopping worker container_1644315356936_0001_01_000002(083c75858fb9:33809).
INFO o.a.f.yarn.YarnResourceManagerDriver [] - Stopping
container container_1644315356936_0001_01_000002(083c75858fb9:33809).
INFO o.a.f.runtime.resourcemanager.active.ActiveResourceManager [] -
Closing TaskExecutor connection
container_1644315356936_0001_01_000002(083c75858fb9:33809) because:
TaskExecutor exceeded the idle timeout.
INFO org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl [] -
Processing Event EventType: STOP_CONTAINER for Container
container_1644315356936_0001_01_000002
2022-02-08 10:17:06,411 WARN
o.a.f.runtime.resourcemanager.active.ActiveResourceManager [] - Discard
registration from TaskExecutor
container_1644315356936_0001_01_000002(083c75858fb9:33809) at
(akka.tcp://flink@083c75858fb9:33093/user/rpc/taskmanager_0) because the
framework did not recognize it
```
This looks like a TM tries to re-register with the JM after a shutdown was
requested.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]