Hi all, When using hadoop distcp to copy data to Ozone (using ofs:// rooted file system), I see logging like this in the app master:
...
2025-01-26 19:27:39,543 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
attempt_1737453940167_12142_m_000002_0 TaskAttempt Transitioned from RUNNING to
COMMIT_PENDING
2025-01-26 19:27:39,543 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl:
attempt_1737453940167_12142_m_000002_0 given a go for committing the task
output.
2025-01-26 19:27:39,545 INFO [IPC Server handler 7 on default port 39515]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request from
attempt_1737453940167_12142_m_000002_0
2025-01-26 19:27:39,546 INFO [IPC Server handler 7 on default port 39515]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Result of canCommit for
attempt_1737453940167_12142_m_000002_0:true
2025-01-26 19:27:39,580 INFO [IPC Server handler 9 on default port 39515]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1737453940167_12142_m_000002_0 is : 1.0
2025-01-26 19:27:39,582 INFO [IPC Server handler 14 on default port 39515]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Done acknowledgment from
attempt_1737453940167_12142_m_000002_0
2025-01-26 19:27:39,584 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
attempt_1737453940167_12142_m_000002_0 TaskAttempt Transitioned from
COMMIT_PENDING to SUCCESS_FINISHING_CONTAINER
2025-01-26 19:27:39,589 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded with
attempt attempt_1737453940167_12142_m_000002_0
2025-01-26 19:28:43,878 INFO [Ping Checker for TaskAttemptFinishingMonitor]
org.apache.hadoop.yarn.util.AbstractLivelinessMonitor:
Expired:attempt_1737453940167_12142_m_000002_0 Timed out after 60 secs
2025-01-26 19:28:43,878 WARN [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Task attempt
attempt_1737453940167_12142_m_000002_0 is done from TaskUmbilicalProtocol's
point of view. However, it stays in finishing state for too long
2025-01-26 19:28:43,879 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
attempt_1737453940167_12142_m_000002_0 TaskAttempt Transitioned from
SUCCESS_FINISHING_CONTAINER to SUCCESS_CONTAINER_CLEANUP
2025-01-26 19:28:43,880 INFO [ContainerLauncher #3]
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing
the event EventType: CONTAINER_REMOTE_CLEANUP for container
container_e173_1737453940167_12142_01_000004 taskAttempt
attempt_1737453940167_12142_m_000002_0
2025-01-26 19:28:43,880 INFO [ContainerLauncher #3]
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING
attempt_1737453940167_12142_m_000002_0
2025-01-26 19:28:43,916 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
attempt_1737453940167_12142_m_000002_0 TaskAttempt Transitioned from
SUCCESS_CONTAINER_CLEANUP to SUCCEEDED
2025-01-26 19:28:44,314 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report
from attempt_1737453940167_12142_m_000002_0: [2025-01-26 19:28:43.912]Container
killed by the ApplicationMaster.
I think this is related to non-daemon Netty threads like this in the map tasks:
"NettyClientStreamRpc-workerGroup--thread5" #84 prio=5 os_prio=0 cpu=86.68ms
elapsed=86.45s tid=0x00007f4e19609110 nid=0x125710 runnable
[0x00007f4da7aea000]
java.lang.Thread.State: RUNNABLE
at
org.apache.ratis.thirdparty.io.netty.channel.epoll.Native.epollWait(Native
Method)
at
org.apache.ratis.thirdparty.io.netty.channel.epoll.Native.epollWait(Native.java:220)
at
org.apache.ratis.thirdparty.io.netty.channel.epoll.Native.epollWait(Native.java:213)
at
org.apache.ratis.thirdparty.io.netty.channel.epoll.EpollEventLoop.epollWaitNoTimerChange(EpollEventLoop.java:308)
at
org.apache.ratis.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:365)
at
org.apache.ratis.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
at
org.apache.ratis.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at java.lang.Thread.run([email protected]/Thread.java:840)
I only see this behaviour when using -Dozone.fs.datastream.enabled=true. I
think this is related to how org.apache.ratis.netty.server.NettyServerStreamRpc
sets up threads.
This is probably a Ratis issue; but I experienced it when using Ozone on
Hadoop; hence the mail to both developer mailing list.
Best regards,
Frens Jan
Award-winning OSINT partner for Law Enforcement and Defence.
Frens Jan Rumph
Data platform engineering lead
phone:
site:
pgp: +31 50 21 11 622
web-iq.com <https://web-iq.com/>
CEE2 A4F1 972E 78C0 F816
86BB D096 18E2 3AC0 16E0
The content of this email is confidential and intended for the recipient(s)
specified in this message only. It is strictly forbidden to share any part of
this message with any third party, without a written consent of the sender. If
you received this message by mistake, please reply to this message and follow
with its deletion, so that we can ensure such a mistake does not occur in the
future.
signature.asc
Description: Message signed with OpenPGP
