Frens Jan Rumph created RATIS-2243:
--------------------------------------

             Summary: Non-daemon threads for streams keep distcp tasks open
                 Key: RATIS-2243
                 URL: https://issues.apache.org/jira/browse/RATIS-2243
             Project: Ratis
          Issue Type: Bug
          Components: Streaming
    Affects Versions: 3.1.2
            Reporter: Frens Jan Rumph


When using hadoop distcp to copy data to Ozone (using ofs:// rooted file 
system), I see logging like this in the app master:
 
...
2025-01-26 19:27:39,543 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
attempt_1737453940167_12142_m_000002_0 TaskAttempt Transitioned from RUNNING to 
COMMIT_PENDING
2025-01-26 19:27:39,543 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: 
attempt_1737453940167_12142_m_000002_0 given a go for committing the task 
output.
2025-01-26 19:27:39,545 INFO [IPC Server handler 7 on default port 39515] 
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request from 
attempt_1737453940167_12142_m_000002_0
2025-01-26 19:27:39,546 INFO [IPC Server handler 7 on default port 39515] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Result of canCommit for 
attempt_1737453940167_12142_m_000002_0:true
2025-01-26 19:27:39,580 INFO [IPC Server handler 9 on default port 39515] 
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt 
attempt_1737453940167_12142_m_000002_0 is : 1.0
2025-01-26 19:27:39,582 INFO [IPC Server handler 14 on default port 39515] 
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Done acknowledgment from 
attempt_1737453940167_12142_m_000002_0
2025-01-26 19:27:39,584 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
attempt_1737453940167_12142_m_000002_0 TaskAttempt Transitioned from 
COMMIT_PENDING to SUCCESS_FINISHING_CONTAINER
2025-01-26 19:27:39,589 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded with 
attempt attempt_1737453940167_12142_m_000002_0
2025-01-26 19:28:43,878 INFO [Ping Checker for TaskAttemptFinishingMonitor] 
org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: 
Expired:attempt_1737453940167_12142_m_000002_0 Timed out after 60 secs
2025-01-26 19:28:43,878 WARN [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Task attempt 
attempt_1737453940167_12142_m_000002_0 is done from TaskUmbilicalProtocol's 
point of view. However, it stays in finishing state for too long
2025-01-26 19:28:43,879 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
attempt_1737453940167_12142_m_000002_0 TaskAttempt Transitioned from 
SUCCESS_FINISHING_CONTAINER to SUCCESS_CONTAINER_CLEANUP
2025-01-26 19:28:43,880 INFO [ContainerLauncher #3] 
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing 
the event EventType: CONTAINER_REMOTE_CLEANUP for container 
container_e173_1737453940167_12142_01_000004 taskAttempt 
attempt_1737453940167_12142_m_000002_0
2025-01-26 19:28:43,880 INFO [ContainerLauncher #3] 
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING 
attempt_1737453940167_12142_m_000002_0
2025-01-26 19:28:43,916 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
attempt_1737453940167_12142_m_000002_0 TaskAttempt Transitioned from 
SUCCESS_CONTAINER_CLEANUP to SUCCEEDED
2025-01-26 19:28:44,314 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report 
from attempt_1737453940167_12142_m_000002_0: [2025-01-26 19:28:43.912]Container 
killed by the ApplicationMaster.
 
 
I think this is related to non-daemon Netty threads like this in the map tasks:
 
"NettyClientStreamRpc-workerGroup--thread5" #84 prio=5 os_prio=0 cpu=86.68ms 
elapsed=86.45s tid=0x00007f4e19609110 nid=0x125710 runnable 
[0x00007f4da7aea000] java.lang.Thread.State: RUNNABLE at 
org.apache.ratis.thirdparty.io.netty.channel.epoll.Native.epollWait(Native 
Method) at 
org.apache.ratis.thirdparty.io.netty.channel.epoll.Native.epollWait(Native.java:220)
 at 
org.apache.ratis.thirdparty.io.netty.channel.epoll.Native.epollWait(Native.java:213)
 at 
org.apache.ratis.thirdparty.io.netty.channel.epoll.EpollEventLoop.epollWaitNoTimerChange(EpollEventLoop.java:308)
 at 
org.apache.ratis.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:365)
 at 
org.apache.ratis.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
 at 
org.apache.ratis.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
 at java.lang.Thread.run([email protected]/Thread.java:840)
 
 
I only see this behaviour when using -Dozone.fs.datastream.enabled=true. I 
think this is related to how org.apache.ratis.netty.server.NettyServerStreamRpc 
sets up threads.
 
This is probably a Ratis issue, so I failed the (possible) bug here. But I 
experienced it when using Ozone on Hadoop, so please redirect me if need be.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to