Frens Jan Rumph created RATIS-2243:
--------------------------------------
Summary: Non-daemon threads for streams keep distcp tasks open
Key: RATIS-2243
URL: https://issues.apache.org/jira/browse/RATIS-2243
Project: Ratis
Issue Type: Bug
Components: Streaming
Affects Versions: 3.1.2
Reporter: Frens Jan Rumph
When using hadoop distcp to copy data to Ozone (using ofs:// rooted file
system), I see logging like this in the app master:
...
2025-01-26 19:27:39,543 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
attempt_1737453940167_12142_m_000002_0 TaskAttempt Transitioned from RUNNING to
COMMIT_PENDING
2025-01-26 19:27:39,543 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl:
attempt_1737453940167_12142_m_000002_0 given a go for committing the task
output.
2025-01-26 19:27:39,545 INFO [IPC Server handler 7 on default port 39515]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request from
attempt_1737453940167_12142_m_000002_0
2025-01-26 19:27:39,546 INFO [IPC Server handler 7 on default port 39515]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Result of canCommit for
attempt_1737453940167_12142_m_000002_0:true
2025-01-26 19:27:39,580 INFO [IPC Server handler 9 on default port 39515]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1737453940167_12142_m_000002_0 is : 1.0
2025-01-26 19:27:39,582 INFO [IPC Server handler 14 on default port 39515]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Done acknowledgment from
attempt_1737453940167_12142_m_000002_0
2025-01-26 19:27:39,584 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
attempt_1737453940167_12142_m_000002_0 TaskAttempt Transitioned from
COMMIT_PENDING to SUCCESS_FINISHING_CONTAINER
2025-01-26 19:27:39,589 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded with
attempt attempt_1737453940167_12142_m_000002_0
2025-01-26 19:28:43,878 INFO [Ping Checker for TaskAttemptFinishingMonitor]
org.apache.hadoop.yarn.util.AbstractLivelinessMonitor:
Expired:attempt_1737453940167_12142_m_000002_0 Timed out after 60 secs
2025-01-26 19:28:43,878 WARN [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Task attempt
attempt_1737453940167_12142_m_000002_0 is done from TaskUmbilicalProtocol's
point of view. However, it stays in finishing state for too long
2025-01-26 19:28:43,879 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
attempt_1737453940167_12142_m_000002_0 TaskAttempt Transitioned from
SUCCESS_FINISHING_CONTAINER to SUCCESS_CONTAINER_CLEANUP
2025-01-26 19:28:43,880 INFO [ContainerLauncher #3]
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing
the event EventType: CONTAINER_REMOTE_CLEANUP for container
container_e173_1737453940167_12142_01_000004 taskAttempt
attempt_1737453940167_12142_m_000002_0
2025-01-26 19:28:43,880 INFO [ContainerLauncher #3]
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING
attempt_1737453940167_12142_m_000002_0
2025-01-26 19:28:43,916 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
attempt_1737453940167_12142_m_000002_0 TaskAttempt Transitioned from
SUCCESS_CONTAINER_CLEANUP to SUCCEEDED
2025-01-26 19:28:44,314 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report
from attempt_1737453940167_12142_m_000002_0: [2025-01-26 19:28:43.912]Container
killed by the ApplicationMaster.
I think this is related to non-daemon Netty threads like this in the map tasks:
"NettyClientStreamRpc-workerGroup--thread5" #84 prio=5 os_prio=0 cpu=86.68ms
elapsed=86.45s tid=0x00007f4e19609110 nid=0x125710 runnable
[0x00007f4da7aea000] java.lang.Thread.State: RUNNABLE at
org.apache.ratis.thirdparty.io.netty.channel.epoll.Native.epollWait(Native
Method) at
org.apache.ratis.thirdparty.io.netty.channel.epoll.Native.epollWait(Native.java:220)
at
org.apache.ratis.thirdparty.io.netty.channel.epoll.Native.epollWait(Native.java:213)
at
org.apache.ratis.thirdparty.io.netty.channel.epoll.EpollEventLoop.epollWaitNoTimerChange(EpollEventLoop.java:308)
at
org.apache.ratis.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:365)
at
org.apache.ratis.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
at
org.apache.ratis.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at java.lang.Thread.run([email protected]/Thread.java:840)
I only see this behaviour when using -Dozone.fs.datastream.enabled=true. I
think this is related to how org.apache.ratis.netty.server.NettyServerStreamRpc
sets up threads.
This is probably a Ratis issue, so I failed the (possible) bug here. But I
experienced it when using Ozone on Hadoop, so please redirect me if need be.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)