Non-daemon threads for streams keep distcp tasks open

Frens Jan Rumph Sun, 26 Jan 2025 13:21:31 -0800

Hi all,

When using hadoop distcp to copy data to Ozone (using ofs:// rooted file 
system), I see logging like this in the app master:


...
2025-01-26 19:27:39,543 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
attempt_1737453940167_12142_m_000002_0 TaskAttempt Transitioned from RUNNING to 
COMMIT_PENDING
2025-01-26 19:27:39,543 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: 
attempt_1737453940167_12142_m_000002_0 given a go for committing the task 
output.
2025-01-26 19:27:39,545 INFO [IPC Server handler 7 on default port 39515] 
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request from 
attempt_1737453940167_12142_m_000002_0
2025-01-26 19:27:39,546 INFO [IPC Server handler 7 on default port 39515] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Result of canCommit for 
attempt_1737453940167_12142_m_000002_0:true
2025-01-26 19:27:39,580 INFO [IPC Server handler 9 on default port 39515] 
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt 
attempt_1737453940167_12142_m_000002_0 is : 1.0
2025-01-26 19:27:39,582 INFO [IPC Server handler 14 on default port 39515] 
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Done acknowledgment from 
attempt_1737453940167_12142_m_000002_0
2025-01-26 19:27:39,584 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
attempt_1737453940167_12142_m_000002_0 TaskAttempt Transitioned from 
COMMIT_PENDING to SUCCESS_FINISHING_CONTAINER
2025-01-26 19:27:39,589 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded with 
attempt attempt_1737453940167_12142_m_000002_0
2025-01-26 19:28:43,878 INFO [Ping Checker for TaskAttemptFinishingMonitor] 
org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: 
Expired:attempt_1737453940167_12142_m_000002_0 Timed out after 60 secs
2025-01-26 19:28:43,878 WARN [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Task attempt 
attempt_1737453940167_12142_m_000002_0 is done from TaskUmbilicalProtocol's 
point of view. However, it stays in finishing state for too long
2025-01-26 19:28:43,879 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
attempt_1737453940167_12142_m_000002_0 TaskAttempt Transitioned from 
SUCCESS_FINISHING_CONTAINER to SUCCESS_CONTAINER_CLEANUP
2025-01-26 19:28:43,880 INFO [ContainerLauncher #3] 
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing 
the event EventType: CONTAINER_REMOTE_CLEANUP for container 
container_e173_1737453940167_12142_01_000004 taskAttempt 
attempt_1737453940167_12142_m_000002_0
2025-01-26 19:28:43,880 INFO [ContainerLauncher #3] 
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING 
attempt_1737453940167_12142_m_000002_0
2025-01-26 19:28:43,916 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
attempt_1737453940167_12142_m_000002_0 TaskAttempt Transitioned from 
SUCCESS_CONTAINER_CLEANUP to SUCCEEDED
2025-01-26 19:28:44,314 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report 
from attempt_1737453940167_12142_m_000002_0: [2025-01-26 19:28:43.912]Container 
killed by the ApplicationMaster.


I think this is related to non-daemon Netty threads like this in the map tasks:

"NettyClientStreamRpc-workerGroup--thread5" #84 prio=5 os_prio=0 cpu=86.68ms 
elapsed=86.45s tid=0x00007f4e19609110 nid=0x125710 runnable  
[0x00007f4da7aea000]
   java.lang.Thread.State: RUNNABLE
        at 
org.apache.ratis.thirdparty.io.netty.channel.epoll.Native.epollWait(Native 
Method)
        at 
org.apache.ratis.thirdparty.io.netty.channel.epoll.Native.epollWait(Native.java:220)
        at 
org.apache.ratis.thirdparty.io.netty.channel.epoll.Native.epollWait(Native.java:213)
        at 
org.apache.ratis.thirdparty.io.netty.channel.epoll.EpollEventLoop.epollWaitNoTimerChange(EpollEventLoop.java:308)
        at 
org.apache.ratis.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:365)
        at 
org.apache.ratis.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
        at 
org.apache.ratis.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at java.lang.Thread.run([email protected]/Thread.java:840)


I only see this behaviour when using -Dozone.fs.datastream.enabled=true. I 
think this is related to how org.apache.ratis.netty.server.NettyServerStreamRpc 
sets up threads.

This is probably a Ratis issue; but I experienced it when using Ozone on 
Hadoop; hence the mail to both developer mailing list.

Best regards,
Frens Jan



               
Award-winning OSINT partner for Law Enforcement and Defence.

Frens Jan Rumph
Data platform engineering lead

phone:
site: 

pgp:    +31 50 21 11 622
web-iq.com <https://web-iq.com/>

CEE2 A4F1 972E 78C0 F816
86BB D096 18E2 3AC0 16E0
The content of this email is confidential and intended for the recipient(s) 
specified in this message only. It is strictly forbidden to share any part of 
this message with any third party, without a written consent of the sender. If 
you received this message by mistake, please reply to this message and follow 
with its deletion, so that we can ensure such a mistake does not occur in the 
future.

signature.asc
Description: Message signed with OpenPGP

Non-daemon threads for streams keep distcp tasks open

Reply via email to