[jira] [Comment Edited] (HDDS-6126) MR Acceptance test failure on Streaming Branch (HDDS-4454)

Sadanand Shenoy (Jira) Sun, 13 Nov 2022 23:14:15 -0800


    [ 
https://issues.apache.org/jira/browse/HDDS-6126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17633537#comment-17633537
 ]


Sadanand Shenoy edited comment on HDDS-6126 at 11/14/22 7:13 AM:
-----------------------------------------------------------------

Some more findings:

After the command is run once, even though job finishes and job counters are 
displayed,  the MRAppmaster is still running . For a good run this immediately 
terminates after the job completes. Also on querying the AM container status , 
it shows as running although the job has already been completed 
{code:java}
hadoop@nm ~]$ yarn container -status container_1668407558972_0003_01_000001
2022-11-14 06:53:23 INFO  RMProxy:98 - Connecting to ResourceManager at 
rm/172.18.0.8:8032
Container Report :
    Container-Id : container_1668407558972_0003_01_000001
    Start-Time : 1668408723283
    Finish-Time : 0
    State : RUNNING
    LOG-URL : 
http://nm:8042/node/containerlogs/container_1668407558972_0003_01_000001/hadoop
    Host : nm:35245
    NodeHttpAddress : http://nm:8042
    Diagnostics : null
[hadoop@nm ~]$ jps
1636 MRAppMaster
102 NodeManager
2238 Jps {code}
Took a jstack of the MRAppMaster to check what it’s doing and can see runnable 
threads for NettyClientStreamRpc, might be that client has not closed it's 
stream?
{code:java}
"NettyClientStreamRpc-workerGroup--thread1" #146 prio=5 os_prio=0 
tid=0x0000560fc7008000 nid=0x33a runnable [0x00007f33680a3000]
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
        - locked <0x00000000f026faa8> (a 
org.apache.ratis.thirdparty.io.netty.channel.nio.SelectedSelectionKeySet)
        - locked <0x00000000f025f840> (a java.util.Collections$UnmodifiableSet)
        - locked <0x00000000f025f728> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:101)
        at 
org.apache.ratis.thirdparty.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:68)
        at 
org.apache.ratis.thirdparty.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:813)
        at 
org.apache.ratis.thirdparty.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:460)
        at 
org.apache.ratis.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:995)
        at 
org.apache.ratis.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at java.lang.Thread.run(Thread.java:748) {code}

On force kill of MRAppMaster process, can see 'client cancelled' which means 
client was still alive?


{code:java}
datanode_3  | 2022-11-14 06:48:43 WARN  GrpcClientProtocolService:122 - 
7-OrderedRequestStreamObserver7: onError: 
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: client 
cancelled
datanode_3  | 2022-11-14 06:48:43 WARN  GrpcClientProtocolService:122 - 
8-UnorderedRequestStreamObserver8: onError: 
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: client 
cancelled
{code}

 


was (Author: sadanand_shenoy):
Some more findings:

After the command is run once, even though job finishes and job counters are 
displayed,  the MRAppmaster is still running . For a good run this immediately 
terminates after the job completes. Also on querying the AM container status , 
it shows as running although the job has already been completed 
{code:java}
hadoop@nm ~]$ yarn container -status container_1668407558972_0003_01_000001
2022-11-14 06:53:23 INFO  RMProxy:98 - Connecting to ResourceManager at 
rm/172.18.0.8:8032
Container Report :
    Container-Id : container_1668407558972_0003_01_000001
    Start-Time : 1668408723283
    Finish-Time : 0
    State : RUNNING
    LOG-URL : 
http://nm:8042/node/containerlogs/container_1668407558972_0003_01_000001/hadoop
    Host : nm:35245
    NodeHttpAddress : http://nm:8042
    Diagnostics : null
[hadoop@nm ~]$ jps
1636 MRAppMaster
102 NodeManager
2238 Jps {code}
Took a jstack of the MRAppMaster to check what it’s doing and can see runnable 
threads for NettyClientStreamRpc, might be that client has not closed it's 
stream?
{code:java}
"NettyClientStreamRpc-workerGroup--thread1" #146 prio=5 os_prio=0 
tid=0x0000560fc7008000 nid=0x33a runnable [0x00007f33680a3000]
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
        - locked <0x00000000f026faa8> (a 
org.apache.ratis.thirdparty.io.netty.channel.nio.SelectedSelectionKeySet)
        - locked <0x00000000f025f840> (a java.util.Collections$UnmodifiableSet)
        - locked <0x00000000f025f728> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:101)
        at 
org.apache.ratis.thirdparty.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:68)
        at 
org.apache.ratis.thirdparty.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:813)
        at 
org.apache.ratis.thirdparty.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:460)
        at 
org.apache.ratis.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:995)
        at 
org.apache.ratis.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at java.lang.Thread.run(Thread.java:748) {code}
 

> MR Acceptance test failure on Streaming Branch (HDDS-4454)
> ----------------------------------------------------------
>
>                 Key: HDDS-6126
>                 URL: https://issues.apache.org/jira/browse/HDDS-6126
>             Project: Apache Ozone
>          Issue Type: Sub-task
>            Reporter: Sadanand Shenoy
>            Assignee: Sadanand Shenoy
>            Priority: Major
>         Attachments: NM-jstack.txt, nm_logs.log
>
>
> The ozone MR Acceptance fails on the streaming branch if 
> ozone.fs.datastream.enable is set to true.
> The test basically runs a yarn wordcount job and  the observation here is 
> that the job doesn’t release the Application master (AM container ) as soon 
> as the job is done and the job counters are displayed which is why the second 
> job waits for a long time ~10 min for the AM container to be released. 
> Tried with enabling streaming and disabling streaming and verified that the 
> problem occurs only when  ozone.fs.datastream.enable=true 
> {code:java}
> yarn jar 
> /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar pi -D  
>  fs.defaultFS=o3fs://bucket1.volume1.om/  -D ozone.fs.datastream.enable=true 
> 3 3{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (HDDS-6126) MR Acceptance test failure on Streaming Branch (HDDS-4454)

Reply via email to