[jira] [Commented] (HADOOP-14858) Why Yarn crashes ?

Hans Brende (JIRA) Thu, 06 Sep 2018 10:19:55 -0700


    [ 
https://issues.apache.org/jira/browse/HADOOP-14858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16606100#comment-16606100
 ]


Hans Brende commented on HADOOP-14858:
--------------------------------------

+1: I am experiencing the same issue.

> Why Yarn crashes ?
> ------------------
>
>                 Key: HADOOP-14858
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14858
>             Project: Hadoop Common
>          Issue Type: Bug
>         Environment: Production 
>            Reporter: anikad ayman
>            Priority: Major
>             Fix For: 2.7.0
>
>
> During MapReduce processing, Yarn did crash and the processing of jobs had 
> stopped. I successed to back the processing after killing the first job which 
> was running, but after some minutes, another crach thatI solved by killing 
> the second job wich was running.
>  
> We are looking for reasons of this crach that we had several times before 
> (between one to two times in a month)
>  
> In ressource manager logs , I find this messages repeated from the beggining 
> of the crach until the killing of the first job, and also after some minute 
> before killing the second job
>  
>     
> {code:java}
> 2017-08-25 03:51:58,815 WARN org.apache.hadoop.ipc.Server: Large response 
> size 4739374 for call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplications from 
> 10.135.8.101:38352 Call#33361 Retry#0
>     2017-08-25 03:53:39,255 WARN org.apache.hadoop.ipc.Server: Large response 
> size 4739374 for call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplications from 
> 10.135.8.101:38456 Call#33364 Retry#0
>     2017-08-25 03:55:19,700 WARN org.apache.hadoop.ipc.Server: Large response 
> size 4739374 for call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplications from 
> 10.135.8.101:38556 Call#33367 Retry#0
>     2017-08-25 03:57:00,262 WARN org.apache.hadoop.ipc.Server: Large response 
> size 4739374 for call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplications from 
> 10.135.8.101:38674 Call#33370 Retry#0
>     2017-08-25 03:58:40,687 WARN org.apache.hadoop.ipc.Server: Large response 
> size 4739374 for call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplications from 
> 10.135.8.101:38804 Call#33373 Retry#0
> {code}
>     .
>     .
>     .
>     
> {code:java}
> 2017-08-25 11:02:44,086 WARN org.apache.hadoop.ipc.Server: Large response 
> size 4751251 for call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplications from 
> 10.135.8.101:39778 Call#34159 Retry#0
>     2017-08-25 11:02:47,933 WARN org.apache.hadoop.ipc.Server: Large response 
> size 4751251 for call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplications from 
> 10.135.8.101:39778 Call#34162 Retry#0
>     2017-08-25 11:03:06,800 WARN org.apache.hadoop.ipc.Server: Large response 
> size 4751251 for call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplications from 
> 10.135.8.101:39814 Call#34165 Retry#0
>  
> {code}
> NB: We still get this warning from time to another, we still wondring if it 
> concerns a connexion between the node manager (10.135.8.101) and the 
> ressource manager, or something else ?
>  
> The same thing for the node manager logs :
>     
> {code:java}
> 2017-08-25 03:51:54,396 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Memory usage of ProcessTree 98201 for container-id 
> container_e41_1500982512144_36679_01_000382: 1.4 GB of 10 GB physical memory 
> used; 10.1 GB of 21 GB virtual memory used
>     2017-08-25 03:51:54,791 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Memory usage of ProcessTree 112912 for container-id 
> container_e41_1500982512144_36679_01_000387: 2.3 GB of 10 GB physical memory 
> used; 10.1 GB of 21 GB virtual memory used
>     2017-08-25 03:51:55,177 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Memory usage of ProcessTree 105848 for container-id 
> container_e41_1500982512144_36627_01_001644: 619.4 MB of 10 GB physical 
> memory used; 10.1 GB of 21 GB virtual memory used
>     2017-08-25 03:51:58,938 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Memory usage of ProcessTree 98201 for container-id 
> container_e41_1500982512144_36679_01_000382: 1.4 GB of 10 GB physical memory 
> used; 10.1 GB of 21 GB virtual memory used
> {code}
>     .
>     .
>     .
>     
> {code:java}
> 2017-08-25 11:05:40,104 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Memory usage of ProcessTree 112912 for container-id 
> container_e41_1500982512144_36679_01_000387: 1.1 GB of 10 GB physical memory 
> used; 10.1 GB of 21 GB virtual memory used
>     2017-08-25 11:05:40,493 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Memory usage of ProcessTree 105848 for container-id 
> container_e41_1500982512144_36627_01_001644: 648.4 MB of 10 GB physical 
> memory used; 10.1 GB of 21 GB virtual memory used
>     2017-08-25 11:05:43,867 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Memory usage of ProcessTree 98201 for container-id 
> container_e41_1500982512144_36679_01_000382: 1.1 GB of 10 GB physical memory 
> used; 10.1 GB of 21 GB virtual memory used
>     2017-08-25 11:05:45,040 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Memory usage of ProcessTree 105848 for container-id 
> container_e41_1500982512144_36627_01_001644: 648.4 MB of 10 GB physical 
> memory used; 10.1 GB of 21 GB virtual memory used
>     2017-08-25 11:05:48,397 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_e41_1500982512144_36627_01_001644 transitioned from 
> RUNNING to KILLING
>     2017-08-25 11:05:48,397 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl:
>  Application application_1500982512144_36627 transitioned from RUNNING to 
> FINISHING_CONTAINERS_WAIT
>     2017-08-25 11:05:48,397 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_e41_1500982512144_36627_01_001644
> {code}
> and also for the job history :
>     
> {code:java}
> 2017-08-25 03:53:06,504 INFO org.apache.hadoop.mapreduce.v2.hs.JobHistory: 
> Starting scan to move intermediate done files
>     2017-08-25 03:56:06,504 INFO 
> org.apache.hadoop.mapreduce.v2.hs.JobHistory: Starting scan to move 
> intermediate done files
>     2017-08-25 03:59:06,504 INFO 
> org.apache.hadoop.mapreduce.v2.hs.JobHistory: Starting scan to move 
> intermediate done files
>     2017-08-25 04:02:06,504 INFO 
> org.apache.hadoop.mapreduce.v2.hs.JobHistory: Starting scan to move 
> intermediate done files
>     2017-08-25 04:05:06,504 INFO 
> org.apache.hadoop.mapreduce.v2.hs.JobHistory: Starting scan to move 
> intermediate done files
>     2017-08-25 04:08:06,504 INFO 
> org.apache.hadoop.mapreduce.v2.hs.JobHistory: Starting scan to move 
> intermediate done files
>     2017-08-25 04:11:06,504 INFO 
> org.apache.hadoop.mapreduce.v2.hs.JobHistory: Starting scan to move 
> intermediate done files 
> {code}
> .
> .
> .
>     
> {code:java}
> 2017-08-25 11:05:36,504 INFO org.apache.hadoop.mapreduce.v2.hs.JobHistory: 
> History Cleaner started
>     2017-08-25 11:05:41,271 INFO 
> org.apache.hadoop.mapreduce.v2.hs.JobHistory: History Cleaner complete
>     2017-08-25 11:06:04,214 INFO 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
>  Updating the current master key for generating delegation tokens
>     2017-08-25 11:08:06,504 INFO 
> org.apache.hadoop.mapreduce.v2.hs.JobHistory: Starting scan to move 
> intermediate done files
>     2017-08-25 11:08:06,518 INFO 
> org.apache.hadoop.mapreduce.jobhistory.JobSummary: 
> jobId=job_1500982512144_36793,submitTime=1503647426340,launchTime=1503651960434,firstMapTaskLaunchTime=1503651982671,firstReduceTaskLaunchTime=0,finishTime=1503651985794,resourcesPerMap=5120,resourcesPerReduce=0,numMaps=1,numReduces=0,user=mapr,queue=default,status=SUCCEEDED,mapSlotSeconds=9,reduceSlotSeconds=0,jobName=SELECT
>  `C_7361705f62736973`.`buk...20170825)(Stage-1)
>     2017-08-25 11:08:06,518 INFO 
> org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager: Deleting JobSummary 
> file: 
> [maprfs:/var/mapr/cluster/yarn/rm/staging/history/done_intermediate/mapr/job_1500982512144_36793.summary]
>     2017-08-25 11:08:06,518 INFO 
> org.apache.hadoop.mapreduce.jobhistory.JobSummary: 
> jobId=job_1500982512144_36778,submitTime=1503642110785,launchTime=1503651960266,firstMapTaskLaunchTime=1503651969483,firstReduceTaskLaunchTime=0,finishTime=1503651976016,resourcesPerMap=5120,resourcesPerReduce=0,numMaps=1,numReduces=0,user=mapr,queue=default,status=SUCCEEDED,mapSlotSeconds=19,reduceSlotSeconds=0,jobName=SELECT
>  `C_7361705f7662726b`.`vbe...20170825)(Stage-1)
> {code}
>  
> Please, have you any explication or solution of this issue ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-14858) Why Yarn crashes ?

Reply via email to