[
https://issues.apache.org/jira/browse/HDDS-6126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17633537#comment-17633537
]
Sadanand Shenoy edited comment on HDDS-6126 at 11/14/22 7:13 AM:
-----------------------------------------------------------------
Some more findings:
After the command is run once, even though job finishes and job counters are
displayed, the MRAppmaster is still running . For a good run this immediately
terminates after the job completes. Also on querying the AM container status ,
it shows as running although the job has already been completed
{code:java}
hadoop@nm ~]$ yarn container -status container_1668407558972_0003_01_000001
2022-11-14 06:53:23 INFO RMProxy:98 - Connecting to ResourceManager at
rm/172.18.0.8:8032
Container Report :
Container-Id : container_1668407558972_0003_01_000001
Start-Time : 1668408723283
Finish-Time : 0
State : RUNNING
LOG-URL :
http://nm:8042/node/containerlogs/container_1668407558972_0003_01_000001/hadoop
Host : nm:35245
NodeHttpAddress : http://nm:8042
Diagnostics : null
[hadoop@nm ~]$ jps
1636 MRAppMaster
102 NodeManager
2238 Jps {code}
Took a jstack of the MRAppMaster to check what it’s doing and can see runnable
threads for NettyClientStreamRpc, might be that client has not closed it's
stream?
{code:java}
"NettyClientStreamRpc-workerGroup--thread1" #146 prio=5 os_prio=0
tid=0x0000560fc7008000 nid=0x33a runnable [0x00007f33680a3000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
- locked <0x00000000f026faa8> (a
org.apache.ratis.thirdparty.io.netty.channel.nio.SelectedSelectionKeySet)
- locked <0x00000000f025f840> (a java.util.Collections$UnmodifiableSet)
- locked <0x00000000f025f728> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:101)
at
org.apache.ratis.thirdparty.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:68)
at
org.apache.ratis.thirdparty.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:813)
at
org.apache.ratis.thirdparty.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:460)
at
org.apache.ratis.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:995)
at
org.apache.ratis.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at java.lang.Thread.run(Thread.java:748) {code}
On force kill of MRAppMaster process, can see 'client cancelled' which means
client was still alive?
{code:java}
datanode_3 | 2022-11-14 06:48:43 WARN GrpcClientProtocolService:122 -
7-OrderedRequestStreamObserver7: onError:
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: client
cancelled
datanode_3 | 2022-11-14 06:48:43 WARN GrpcClientProtocolService:122 -
8-UnorderedRequestStreamObserver8: onError:
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: client
cancelled
{code}
was (Author: sadanand_shenoy):
Some more findings:
After the command is run once, even though job finishes and job counters are
displayed, the MRAppmaster is still running . For a good run this immediately
terminates after the job completes. Also on querying the AM container status ,
it shows as running although the job has already been completed
{code:java}
hadoop@nm ~]$ yarn container -status container_1668407558972_0003_01_000001
2022-11-14 06:53:23 INFO RMProxy:98 - Connecting to ResourceManager at
rm/172.18.0.8:8032
Container Report :
Container-Id : container_1668407558972_0003_01_000001
Start-Time : 1668408723283
Finish-Time : 0
State : RUNNING
LOG-URL :
http://nm:8042/node/containerlogs/container_1668407558972_0003_01_000001/hadoop
Host : nm:35245
NodeHttpAddress : http://nm:8042
Diagnostics : null
[hadoop@nm ~]$ jps
1636 MRAppMaster
102 NodeManager
2238 Jps {code}
Took a jstack of the MRAppMaster to check what it’s doing and can see runnable
threads for NettyClientStreamRpc, might be that client has not closed it's
stream?
{code:java}
"NettyClientStreamRpc-workerGroup--thread1" #146 prio=5 os_prio=0
tid=0x0000560fc7008000 nid=0x33a runnable [0x00007f33680a3000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
- locked <0x00000000f026faa8> (a
org.apache.ratis.thirdparty.io.netty.channel.nio.SelectedSelectionKeySet)
- locked <0x00000000f025f840> (a java.util.Collections$UnmodifiableSet)
- locked <0x00000000f025f728> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:101)
at
org.apache.ratis.thirdparty.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:68)
at
org.apache.ratis.thirdparty.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:813)
at
org.apache.ratis.thirdparty.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:460)
at
org.apache.ratis.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:995)
at
org.apache.ratis.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at java.lang.Thread.run(Thread.java:748) {code}
> MR Acceptance test failure on Streaming Branch (HDDS-4454)
> ----------------------------------------------------------
>
> Key: HDDS-6126
> URL: https://issues.apache.org/jira/browse/HDDS-6126
> Project: Apache Ozone
> Issue Type: Sub-task
> Reporter: Sadanand Shenoy
> Assignee: Sadanand Shenoy
> Priority: Major
> Attachments: NM-jstack.txt, nm_logs.log
>
>
> The ozone MR Acceptance fails on the streaming branch if
> ozone.fs.datastream.enable is set to true.
> The test basically runs a yarn wordcount job and the observation here is
> that the job doesn’t release the Application master (AM container ) as soon
> as the job is done and the job counters are displayed which is why the second
> job waits for a long time ~10 min for the AM container to be released.
> Tried with enabling streaming and disabling streaming and verified that the
> problem occurs only when ozone.fs.datastream.enable=true
> {code:java}
> yarn jar
> /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar pi -D
> fs.defaultFS=o3fs://bucket1.volume1.om/ -D ozone.fs.datastream.enable=true
> 3 3{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]