[
https://issues.apache.org/jira/browse/HADOOP-14858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16606100#comment-16606100
]
Hans Brende commented on HADOOP-14858:
--------------------------------------
+1: I am experiencing the same issue.
> Why Yarn crashes ?
> ------------------
>
> Key: HADOOP-14858
> URL: https://issues.apache.org/jira/browse/HADOOP-14858
> Project: Hadoop Common
> Issue Type: Bug
> Environment: Production
> Reporter: anikad ayman
> Priority: Major
> Fix For: 2.7.0
>
>
> During MapReduce processing, Yarn did crash and the processing of jobs had
> stopped. I successed to back the processing after killing the first job which
> was running, but after some minutes, another crach thatI solved by killing
> the second job wich was running.
>
> We are looking for reasons of this crach that we had several times before
> (between one to two times in a month)
>
> In ressource manager logs , I find this messages repeated from the beggining
> of the crach until the killing of the first job, and also after some minute
> before killing the second job
>
>
> {code:java}
> 2017-08-25 03:51:58,815 WARN org.apache.hadoop.ipc.Server: Large response
> size 4739374 for call
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplications from
> 10.135.8.101:38352 Call#33361 Retry#0
> 2017-08-25 03:53:39,255 WARN org.apache.hadoop.ipc.Server: Large response
> size 4739374 for call
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplications from
> 10.135.8.101:38456 Call#33364 Retry#0
> 2017-08-25 03:55:19,700 WARN org.apache.hadoop.ipc.Server: Large response
> size 4739374 for call
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplications from
> 10.135.8.101:38556 Call#33367 Retry#0
> 2017-08-25 03:57:00,262 WARN org.apache.hadoop.ipc.Server: Large response
> size 4739374 for call
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplications from
> 10.135.8.101:38674 Call#33370 Retry#0
> 2017-08-25 03:58:40,687 WARN org.apache.hadoop.ipc.Server: Large response
> size 4739374 for call
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplications from
> 10.135.8.101:38804 Call#33373 Retry#0
> {code}
> .
> .
> .
>
> {code:java}
> 2017-08-25 11:02:44,086 WARN org.apache.hadoop.ipc.Server: Large response
> size 4751251 for call
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplications from
> 10.135.8.101:39778 Call#34159 Retry#0
> 2017-08-25 11:02:47,933 WARN org.apache.hadoop.ipc.Server: Large response
> size 4751251 for call
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplications from
> 10.135.8.101:39778 Call#34162 Retry#0
> 2017-08-25 11:03:06,800 WARN org.apache.hadoop.ipc.Server: Large response
> size 4751251 for call
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplications from
> 10.135.8.101:39814 Call#34165 Retry#0
>
> {code}
> NB: We still get this warning from time to another, we still wondring if it
> concerns a connexion between the node manager (10.135.8.101) and the
> ressource manager, or something else ?
>
> The same thing for the node manager logs :
>
> {code:java}
> 2017-08-25 03:51:54,396 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
> Memory usage of ProcessTree 98201 for container-id
> container_e41_1500982512144_36679_01_000382: 1.4 GB of 10 GB physical memory
> used; 10.1 GB of 21 GB virtual memory used
> 2017-08-25 03:51:54,791 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
> Memory usage of ProcessTree 112912 for container-id
> container_e41_1500982512144_36679_01_000387: 2.3 GB of 10 GB physical memory
> used; 10.1 GB of 21 GB virtual memory used
> 2017-08-25 03:51:55,177 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
> Memory usage of ProcessTree 105848 for container-id
> container_e41_1500982512144_36627_01_001644: 619.4 MB of 10 GB physical
> memory used; 10.1 GB of 21 GB virtual memory used
> 2017-08-25 03:51:58,938 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
> Memory usage of ProcessTree 98201 for container-id
> container_e41_1500982512144_36679_01_000382: 1.4 GB of 10 GB physical memory
> used; 10.1 GB of 21 GB virtual memory used
> {code}
> .
> .
> .
>
> {code:java}
> 2017-08-25 11:05:40,104 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
> Memory usage of ProcessTree 112912 for container-id
> container_e41_1500982512144_36679_01_000387: 1.1 GB of 10 GB physical memory
> used; 10.1 GB of 21 GB virtual memory used
> 2017-08-25 11:05:40,493 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
> Memory usage of ProcessTree 105848 for container-id
> container_e41_1500982512144_36627_01_001644: 648.4 MB of 10 GB physical
> memory used; 10.1 GB of 21 GB virtual memory used
> 2017-08-25 11:05:43,867 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
> Memory usage of ProcessTree 98201 for container-id
> container_e41_1500982512144_36679_01_000382: 1.1 GB of 10 GB physical memory
> used; 10.1 GB of 21 GB virtual memory used
> 2017-08-25 11:05:45,040 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
> Memory usage of ProcessTree 105848 for container-id
> container_e41_1500982512144_36627_01_001644: 648.4 MB of 10 GB physical
> memory used; 10.1 GB of 21 GB virtual memory used
> 2017-08-25 11:05:48,397 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
> Container container_e41_1500982512144_36627_01_001644 transitioned from
> RUNNING to KILLING
> 2017-08-25 11:05:48,397 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl:
> Application application_1500982512144_36627 transitioned from RUNNING to
> FINISHING_CONTAINERS_WAIT
> 2017-08-25 11:05:48,397 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
> Cleaning up container container_e41_1500982512144_36627_01_001644
> {code}
> and also for the job history :
>
> {code:java}
> 2017-08-25 03:53:06,504 INFO org.apache.hadoop.mapreduce.v2.hs.JobHistory:
> Starting scan to move intermediate done files
> 2017-08-25 03:56:06,504 INFO
> org.apache.hadoop.mapreduce.v2.hs.JobHistory: Starting scan to move
> intermediate done files
> 2017-08-25 03:59:06,504 INFO
> org.apache.hadoop.mapreduce.v2.hs.JobHistory: Starting scan to move
> intermediate done files
> 2017-08-25 04:02:06,504 INFO
> org.apache.hadoop.mapreduce.v2.hs.JobHistory: Starting scan to move
> intermediate done files
> 2017-08-25 04:05:06,504 INFO
> org.apache.hadoop.mapreduce.v2.hs.JobHistory: Starting scan to move
> intermediate done files
> 2017-08-25 04:08:06,504 INFO
> org.apache.hadoop.mapreduce.v2.hs.JobHistory: Starting scan to move
> intermediate done files
> 2017-08-25 04:11:06,504 INFO
> org.apache.hadoop.mapreduce.v2.hs.JobHistory: Starting scan to move
> intermediate done files
> {code}
> .
> .
> .
>
> {code:java}
> 2017-08-25 11:05:36,504 INFO org.apache.hadoop.mapreduce.v2.hs.JobHistory:
> History Cleaner started
> 2017-08-25 11:05:41,271 INFO
> org.apache.hadoop.mapreduce.v2.hs.JobHistory: History Cleaner complete
> 2017-08-25 11:06:04,214 INFO
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
> Updating the current master key for generating delegation tokens
> 2017-08-25 11:08:06,504 INFO
> org.apache.hadoop.mapreduce.v2.hs.JobHistory: Starting scan to move
> intermediate done files
> 2017-08-25 11:08:06,518 INFO
> org.apache.hadoop.mapreduce.jobhistory.JobSummary:
> jobId=job_1500982512144_36793,submitTime=1503647426340,launchTime=1503651960434,firstMapTaskLaunchTime=1503651982671,firstReduceTaskLaunchTime=0,finishTime=1503651985794,resourcesPerMap=5120,resourcesPerReduce=0,numMaps=1,numReduces=0,user=mapr,queue=default,status=SUCCEEDED,mapSlotSeconds=9,reduceSlotSeconds=0,jobName=SELECT
> `C_7361705f62736973`.`buk...20170825)(Stage-1)
> 2017-08-25 11:08:06,518 INFO
> org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager: Deleting JobSummary
> file:
> [maprfs:/var/mapr/cluster/yarn/rm/staging/history/done_intermediate/mapr/job_1500982512144_36793.summary]
> 2017-08-25 11:08:06,518 INFO
> org.apache.hadoop.mapreduce.jobhistory.JobSummary:
> jobId=job_1500982512144_36778,submitTime=1503642110785,launchTime=1503651960266,firstMapTaskLaunchTime=1503651969483,firstReduceTaskLaunchTime=0,finishTime=1503651976016,resourcesPerMap=5120,resourcesPerReduce=0,numMaps=1,numReduces=0,user=mapr,queue=default,status=SUCCEEDED,mapSlotSeconds=19,reduceSlotSeconds=0,jobName=SELECT
> `C_7361705f7662726b`.`vbe...20170825)(Stage-1)
> {code}
>
> Please, have you any explication or solution of this issue ?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]