[jira] [Commented] (HDDS-6126) MR Acceptance test failure on Streaming Branch (HDDS-4454)

Sadanand Shenoy (Jira) Tue, 01 Nov 2022 08:48:05 -0700


    [ 
https://issues.apache.org/jira/browse/HDDS-6126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17627219#comment-17627219
 ]


Sadanand Shenoy commented on HDDS-6126:
---------------------------------------

NM log
{code:java}
2022-11-01T14:45:38.462945577Z 2022-11-01 14:45:38 INFO  
ContainerManagerImpl:960 - Stopping container with container Id: 
container_1667306908598_0003_01_000005
2022-11-01T14:45:38.463297250Z 2022-11-01 14:45:38 INFO  ContainerImpl:1135 - 
Container container_1667306908598_0003_01_000005 transitioned from RUNNING to 
KILLING
2022-11-01T14:45:38.463616665Z 2022-11-01 14:45:38 INFO  ContainerLaunch:370 - 
Cleaning up container container_1667306908598_0003_01_000005
2022-11-01T14:45:38.464046248Z 2022-11-01 14:45:38 INFO  NMAuditLogger:89 - 
USER=hadoop    IP=172.20.0.4    OPERATION=Stop Container Request    
TARGET=ContainerManageImpl    RESULT=SUCCESS    
APPID=application_1667306908598_0003    
CONTAINERID=container_1667306908598_0003_01_000005
2022-11-01T14:45:38.476263911Z 2022-11-01 14:45:38 WARN  
DefaultContainerExecutor:224 - Exit code from container 
container_1667306908598_0003_01_000005 is : 143
2022-11-01T14:45:38.489539300Z 2022-11-01 14:45:38 INFO  ContainerImpl:1135 - 
Container container_1667306908598_0003_01_000005 transitioned from KILLING to 
CONTAINER_CLEANEDUP_AFTER_KILL
2022-11-01T14:45:38.490021365Z 2022-11-01 14:45:38 INFO  NMAuditLogger:89 - 
USER=hadoop    OPERATION=Container Finished - Killed    TARGET=ContainerImpl    
RESULT=SUCCESS    APPID=application_1667306908598_0003    
CONTAINERID=container_1667306908598_0003_01_000005
2022-11-01T14:45:38.490184714Z 2022-11-01 14:45:38 INFO  ContainerImpl:1135 - 
Container container_1667306908598_0003_01_000005 transitioned from 
CONTAINER_CLEANEDUP_AFTER_KILL to DONE
2022-11-01T14:45:38.490830742Z 2022-11-01 14:45:38 INFO  ApplicationImpl:347 - 
Removing container_1667306908598_0003_01_000005 from application 
application_1667306908598_0003
2022-11-01T14:45:38.491627397Z 2022-11-01 14:45:38 INFO  
ContainersMonitorImpl:196 - Neither virutal-memory nor physical-memory 
monitoring is needed. Not running the monitor-thread
2022-11-01T14:45:38.492801085Z 2022-11-01 14:45:38 INFO  AuxServices:196 - Got 
event CONTAINER_STOP for appId application_1667306908598_0003
----IDLE for ~10 min-----
2022-11-01T14:58:23.116387795Z 2022-11-01 14:58:23 INFO  Server:1316 - Auth 
successful for appattempt_1667306908598_0003_000001 (auth:SIMPLE)
2022-11-01T14:58:23.125230150Z 2022-11-01 14:58:23 INFO  
ContainerManagerImpl:960 - Stopping container with container Id: 
container_1667306908598_0003_01_000001
2022-11-01T14:58:23.125819375Z 2022-11-01 14:58:23 INFO  NMAuditLogger:89 - 
USER=hadoop    IP=172.20.0.7    OPERATION=Stop Container Request    
TARGET=ContainerManageImpl    RESULT=SUCCESS    
APPID=application_1667306908598_0003    
CONTAINERID=container_1667306908598_0003_01_000001
2022-11-01T14:58:23.126914642Z 2022-11-01 14:58:23 INFO  ContainerImpl:1135 - 
Container container_1667306908598_0003_01_000001 transitioned from RUNNING to 
KILLING
2022-11-01T14:58:23.127085783Z 2022-11-01 14:58:23 INFO  ContainerLaunch:370 - 
Cleaning up container container_1667306908598_0003_01_000001
2022-11-01T14:58:23.138359689Z 2022-11-01 14:58:23 WARN  
DefaultContainerExecutor:224 - Exit code from container 
container_1667306908598_0003_01_000001 is : 143
2022-11-01T14:58:23.154846877Z 2022-11-01 14:58:23 INFO  ContainerImpl:1135 - 
Container container_1667306908598_0003_01_000001 transitioned from KILLING to 
CONTAINER_CLEANEDUP_AFTER_KILL
2022-11-01T14:58:23.156116528Z 2022-11-01 14:58:23 INFO  NMAuditLogger:89 - 
USER=hadoop    OPERATION=Container Finished - Killed    TARGET=ContainerImpl    
RESULT=SUCCESS    APPID=application_1667306908598_0003    
CONTAINERID=container_1667306908598_0003_01_000001
2022-11-01T14:58:23.156486799Z 2022-11-01 14:58:23 INFO  ContainerImpl:1135 - 
Container container_1667306908598_0003_01_000001 transitioned from 
CONTAINER_CLEANEDUP_AFTER_KILL to DONE
2022-11-01T14:58:23.156859401Z 2022-11-01 14:58:23 INFO  ApplicationImpl:347 - 
Removing container_1667306908598_0003_01_000001 from application 
application_1667306908598_0003
2022-11-01T14:58:23.156881610Z 2022-11-01 14:58:23 INFO  
ContainersMonitorImpl:196 - Neither virutal-memory nor physical-memory 
monitoring is needed. Not running the monitor-thread
2022-11-01T14:58:23.157438464Z 2022-11-01 14:58:23 INFO  AuxServices:196 - Got 
event CONTAINER_STOP for appId application_1667306908598_0003
2022-11-01T14:58:23.159658999Z 2022-11-01 14:58:23 INFO  ApplicationImpl:464 - 
Application application_1667306908598_0003 transitioned from RUNNING to 
APPLICATION_RESOURCES_CLEANINGUP
2022-11-01T14:58:23.161177845Z 2022-11-01 14:58:23 INFO  AuxServices:196 - Got 
event APPLICATION_STOP for appId application_1667306908598_0003
2022-11-01T14:58:23.161567999Z 2022-11-01 14:58:23 INFO  ApplicationImpl:464 - 
Application application_1667306908598_0003 transitioned from 
APPLICATION_RESOURCES_CLEANINGUP to FINISHED{code}
 
RM log
{code:java}
 2022-11-01 14:45:39 INFO  RMAppAttemptImpl:1206 - Updating application attempt 
appattempt_1667306908598_0003_000001 with final state: FINISHING, and exit 
status: -1000
rm_1        | 2022-11-01 14:45:39 INFO  RMAppAttemptImpl:813 - 
appattempt_1667306908598_0003_000001 State change from RUNNING to FINAL_SAVING
rm_1        | 2022-11-01 14:45:39 INFO  RMAppImpl:993 - Updating application 
application_1667306908598_0003 with final state: FINISHING
rm_1        | 2022-11-01 14:45:39 INFO  RMAppImpl:715 - 
application_1667306908598_0003 State change from RUNNING to FINAL_SAVING
rm_1        | 2022-11-01 14:45:39 INFO  RMAppAttemptImpl:813 - 
appattempt_1667306908598_0003_000001 State change from FINAL_SAVING to FINISHING
rm_1        | 2022-11-01 14:45:39 INFO  RMStateStore:223 - Updating info for 
app: application_1667306908598_0003
rm_1        | 2022-11-01 14:45:39 INFO  RMAppImpl:715 - 
application_1667306908598_0003 State change from FINAL_SAVING to FINISHING
scm_1       | 2022-11-01 14:45:39,651 [Under Replicated Processor] INFO 
replication.UnderReplicatedProcessor: Processed 0 under replicated containers, 
failed processing 0
scm_1       | 2022-11-01 14:45:39,651 [Over Replicated Processor] INFO 
replication.OverReplicatedProcessor: Processed 0 over replicated containers, 
failed processing 0
rm_1        | 2022-11-01 14:45:40 INFO  ApplicationMasterService:354 - 
application_1667306908598_0003 unregistered successfully.

-- IDLE fo ~10 min

| 2022-11-01 14:58:23 INFO  AbstractLivelinessMonitor:127 - 
Expired:appattempt_1667306908598_0003_000001 Timed out after 600 secs
rm_1        | 2022-11-01 14:58:23 INFO  ApplicationMasterService:684 - 
Unregistering app attempt : appattempt_1667306908598_0003_000001
rm_1        | 2022-11-01 14:58:23 INFO  AMRMTokenSecretManager:124 - 
Application finished, removing password for appattempt_1667306908598_0003_000001
rm_1        | 2022-11-01 14:58:23 INFO  RMAppAttemptImpl:813 - 
appattempt_1667306908598_0003_000001 State change from FINISHING to FINISHED
rm_1        | 2022-11-01 14:58:23 INFO  RMAppImpl:715 - 
application_1667306908598_0003 State change from FINISHING to FINISHED
rm_1        | 2022-11-01 14:58:23 INFO  CapacityScheduler:882 - Application 
Attempt appattempt_1667306908598_0003_000001 is done. finalState=FINISHED
rm_1        | 2022-11-01 14:58:23 INFO  RMAuditLogger:148 - USER=hadoop    
OPERATION=Application Finished - Succeeded    TARGET=RMAppManager    
RESULT=SUCCESS    APPID=application_1667306908598_0003
rm_1        | 2022-11-01 14:58:23 INFO  RMContainerImpl:410 - 
container_1667306908598_0003_01_000001 Container Transitioned from RUNNING to 
KILLED
rm_1        | 2022-11-01 14:58:23 INFO  FiCaSchedulerApp:113 - Completed 
container: container_1667306908598_0003_01_000001 in state: KILLED event:KILL
rm_1        | 2022-11-01 14:58:23 INFO  AMLauncher:263 - Cleaning master 
appattempt_1667306908598_0003_000001
rm_1        | 2022-11-01 14:58:23 INFO  RMAuditLogger:106 - USER=hadoop    
OPERATION=AM Released Container    TARGET=SchedulerApp    RESULT=SUCCESS    
APPID=application_1667306908598_0003    
CONTAINERID=container_1667306908598_0003_01_000001
rm_1        | 2022-11-01 14:58:23 INFO  SchedulerNode:216 - Released container 
container_1667306908598_0003_01_000001 of capacity <memory:2048, vCores:1> on 
host nm:36773, which currently has 0 containers, <memory:0, vCores:0> used and 
<memory:8192, vCores:8> available, release resources=true
rm_1        | 2022-11-01 14:58:23 INFO  LeafQueue:1707 - default 
used=<memory:0, vCores:0> numContainers=0 user=hadoop user-resources=<memory:0, 
vCores:0>
rm_1        | 2022-11-01 14:58:23 INFO  LeafQueue:1658 - completedContainer 
container=Container: [ContainerId: container_1667306908598_0003_01_000001, 
NodeId: nm:36773, NodeHttpAddress: nm:8042, Resource: <memory:2048, vCores:1>, 
Priority: 0, Token: Token { kind: ContainerToken, service: 172.20.0.4:36773 }, 
] queue=default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:0, 
vCores:0>, usedCapacity=0.0, absoluteUsedCapacity=0.0, numApps=1, 
numContainers=0 cluster=<memory:8192, vCores:8>
rm_1        | 2022-11-01 14:58:23 INFO  ParentQueue:568 - completedContainer 
queue=root usedCapacity=0.0 absoluteUsedCapacity=0.0 used=<memory:0, vCores:0> 
cluster=<memory:8192, vCores:8>
rm_1        | 2022-11-01 14:58:23 INFO  ParentQueue:585 - Re-sorting completed 
queue: root.default stats: default: capacity=1.0, absoluteCapacity=1.0, 
usedResources=<memory:0, vCores:0>, usedCapacity=0.0, absoluteUsedCapacity=0.0, 
numApps=1, numContainers=0
rm_1        | 2022-11-01 14:58:23 INFO  CapacityScheduler:1419 - Application 
attempt appattempt_1667306908598_0003_000001 released container 
container_1667306908598_0003_01_000001 on node: host: nm:36773 #containers=0 
available=<memory:8192, vCores:8> used=<memory:0, vCores:0> with event: KILL
rm_1        | 2022-11-01 14:58:23 INFO  AppSchedulingInfo:117 - Application 
application_1667306908598_0003 requests cleared {code}
 

> MR Acceptance test failure on Streaming Branch (HDDS-4454)
> ----------------------------------------------------------
>
>                 Key: HDDS-6126
>                 URL: https://issues.apache.org/jira/browse/HDDS-6126
>             Project: Apache Ozone
>          Issue Type: Sub-task
>            Reporter: Sadanand Shenoy
>            Assignee: Sadanand Shenoy
>            Priority: Major
>         Attachments: NM-jstack.txt
>
>
> The ozone MR Acceptance fails on the streaming branch if 
> ozone.fs.datastream.enable is set to true.
> The test basically runs a yarn wordcount job and  the observation here is 
> that the job doesn’t release the Application master (AM container ) as soon 
> as the job is done and the job counters are displayed which is why the second 
> job waits for a long time ~10 min for the AM container to be released. 
> Tried with enabling streaming and disabling streaming and verified that the 
> problem occurs only when  ozone.fs.datastream.enable=true 
> {code:java}
> yarn jar 
> /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar pi -D  
>  fs.defaultFS=o3fs://bucket1.volume1.om/  -D ozone.fs.datastream.enable=true 
> 3 3{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDDS-6126) MR Acceptance test failure on Streaming Branch (HDDS-4454)

Reply via email to