Re: [jira] [Commented] (MAPREDUCE-3065) ApplicationMaster killed by NodeManager due to excessive virtual memory consumption

Milind.Bhandarkar Wed, 21 Sep 2011 15:01:12 -0700

Chris,

I am pretty sure you are using RHEL6. Please see
https://issues.apache.org/jira/browse/HADOOP-7154 to reduce vmem usage
under RHEL6.


- milind

On 9/21/11 2:54 PM, "Chris Riccomini (JIRA)" <[email protected]> wrote:

>
>    [
>https://issues.apache.org/jira/browse/MAPREDUCE-3065?page=com.atlassian.ji
>ra.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112153
>#comment-13112153 ]
>
>Chris Riccomini commented on MAPREDUCE-3065:
>--------------------------------------------
>
>Hmm, I dropped my NM's -Xmx setting to 512m from 1024m. The VMEM setting
>STILL sits at just under the 2 gig kill limit.
>
>This seems pretty interesting:
>http://stackoverflow.com/questions/561245/virtual-memory-usage-from-java-u
>nder-linux-too-much-memory-used
>
>Reading.
>
>> ApplicationMaster killed by NodeManager due to excessive virtual memory
>>consumption
>>
>>-------------------------------------------------------------------------
>>----------
>>
>>                 Key: MAPREDUCE-3065
>>                 URL:
>>https://issues.apache.org/jira/browse/MAPREDUCE-3065
>>             Project: Hadoop Map/Reduce
>>          Issue Type: Bug
>>          Components: nodemanager
>>    Affects Versions: 0.24.0
>>            Reporter: Chris Riccomini
>>
>> > Hey Vinod,
>> >
>> > OK, so I have a little more clarity into this.
>> >
>> > When I bump my resource request for my AM to 4096, it runs. The
>>important line in the NM logs is:
>> >
>> > 2011-09-21 13:43:44,366 INFO  monitor.ContainersMonitorImpl
>>(ContainersMonitorImpl.java:run(402)) - Memory usage of ProcessTree
>>25656 for container-id container_1316637655278_0001_01_000001 : Virtual
>>2260938752 bytes, limit : 4294967296 bytes; Physical 120860672 bytes,
>>limit -1 bytes
>> >
>> > The thing to note is the virtual memory, which is off the charts,
>>even though my physical memory is almost nothing (12 megs). I'm still
>>poking around the code, but I am noticing that there are two checks in
>>the NM, one for virtual mem, and one for physical mem. The virtual
>>memory check appears to be toggle-able, but is presumably defaulted to
>>on.
>> >
>> > At this point I'm trying to figure out exactly what the VMEM check is
>>for, why YARN thinks my app is taking 2 gigs, and how to fix this.
>> >
>> > Cheers,
>> > Chris
>> > ________________________________________
>> > From: Chris Riccomini [[email protected]]
>> > Sent: Wednesday, September 21, 2011 1:42 PM
>> > To: [email protected]
>> > Subject: Re: ApplicationMaster Memory Usage
>> >
>> > For the record, I bumped to 4096 for memory resource request, and it
>>works.
>> > :(
>> >
>> >
>> > On 9/21/11 1:32 PM, "Chris Riccomini" <[email protected]> wrote:
>> >
>> >> Hey Vinod,
>> >>
>> >> So, I ran my application master directly from the CLI. I commented
>>out the
>> >> YARN-specific code. It runs fine without leaking memory.
>> >>
>> >> I then ran it from YARN, with all YARN-specific code commented it.
>>It again
>> >> ran fine.
>> >>
>> >> I then uncommented JUST my registerWithResourceManager call. It then
>>fails
>> >> with OOM after a few seconds. I call registerWithResourceManager,
>>and then go
>> >> into a while(true) { println("yeh") sleep(1000) }. Doing this prints:
>> >>
>> >> yeh
>> >> yeh
>> >> yeh
>> >> yeh
>> >> yeh
>> >>
>> >> At which point, it dies, and, in the NodeManager,I see:
>> >>
>> >> 2011-09-21 13:24:51,036 WARN  monitor.ContainersMonitorImpl
>> >> (ContainersMonitorImpl.java:isProcessTreeOverLimit(289)) - Process
>>tree for
>> >> container: container_1316626117280_0005_01_000001 has processes
>>older than 1
>> >> iteration running over the configured limit. Limit=2147483648,
>>current usage =
>> >> 2192773120
>> >> 2011-09-21 13:24:51,037 WARN  monitor.ContainersMonitorImpl
>> >> (ContainersMonitorImpl.java:run(453)) - Container
>> >> [pid=23852,containerID=container_1316626117280_0005_01_000001] is
>>running
>> >> beyond memory-limits. Current usage : 2192773120bytes. Limit :
>> >> 2147483648bytes. Killing container.
>> >> Dump of the process-tree for container_1316626117280_0005_01_000001 :
>> >> |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
>>SYSTEM_TIME(MILLIS)
>> >> VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
>> >> |- 23852 20570 23852 23852 (bash) 0 0 108638208 303 /bin/bash -c
>>java -Xmx512M
>> >> -cp './package/*' kafka.yarn.ApplicationMaster
>> >> /home/criccomi/git/kafka-yarn/dist/kafka-streamer.tgz 5 1
>>1316626117280
>> >> com.linkedin.TODO 1
>> >>
>>1>/tmp/logs/application_1316626117280_0005/container_1316626117280_0005_0
>>1_000
>> >> 001/stdout
>> >>
>>2>/tmp/logs/application_1316626117280_0005/container_1316626117280_0005_0
>>1_000
>> >> 001/stderr
>> >> |- 23855 23852 23852 23852 (java) 81 4 2084134912 14772 java
>>-Xmx512M -cp
>> >> ./package/* kafka.yarn.ApplicationMaster
>> >> /home/criccomi/git/kafka-yarn/dist/kafka-streamer.tgz 5 1
>>1316626117280
>> >> com.linkedin.TODO 1
>> >> 2011-09-21 13:24:51,037 INFO  monitor.ContainersMonitorImpl
>> >> (ContainersMonitorImpl.java:run(463)) - Removed ProcessTree with
>>root 23852
>> >>
>> >> Either something is leaking in YARN, or my
>>registerWithResourceManager code
>> >> (see below) is doing something funky.
>> >>
>> >> I'm trying to avoid going through all the pain of attaching a remote
>>debugger.
>> >> Presumably things aren't leaking in YARN, which means it's likely
>>that I'm
>> >> doing something wrong in my registration code.
>> >>
>> >> Incidentally, my NodeManager is running with 1000 megs. My
>>application master
>> >> memory is set to 2048, and my -Xmx setting is 512M
>> >>
>> >> Cheers,
>> >> Chris
>> >> ________________________________________
>> >> From: Vinod Kumar Vavilapalli [[email protected]]
>> >> Sent: Wednesday, September 21, 2011 11:52 AM
>> >> To: [email protected]
>> >> Subject: Re: ApplicationMaster Memory Usage
>> >>
>> >> Actually MAPREDUCE-2998 is only related to MRV2, so that isn't
>>related.
>> >>
>> >> Somehow, your JVM itself is taking so much of virtual memory. Are you
>> >> loading some native libs?
>> >>
>> >> And how many containers have already been allocated by the time the
>>AM
>> >> crashes. May be you are accumulating some per-container data. You
>>can try
>> >> dumping heap vai hprof.
>> >>
>> >> +Vinod
>> >>
>> >>
>> >> On Wed, Sep 21, 2011 at 11:21 PM, Chris Riccomini
>> >> <[email protected]>wrote:
>> >>
>> >>> Hey Vinod,
>> >>>
>> >>> I svn up'd, and rebuilt. My application's task (container) now runs!
>> >>>
>> >>> Unfortunately, my application master eventually gets killed by the
>> >>> NodeManager anyway, and I'm still not clear as to why. The AM is
>>just
>> >>> running a loop, asking for a container, and executing a command in
>>the
>> >>> container. It keeps doing this over and over again. After a few
>>iterations,
>> >>> it gets killed with something like:
>> >>>
>> >>> 2011-09-21 10:42:40,869 INFO  monitor.ContainersMonitorImpl
>> >>> (ContainersMonitorImpl.java:run(402)) - Memory usage of ProcessTree
>>21666
>> >>> for container-id container_1316626117280_0002_01_000001 : Virtual
>>2260938752
>> >>> bytes, limit : 2147483648 bytes; Physical 77398016 bytes, limit -1
>>bytes
>> >>> 2011-09-21 10:42:40,869 WARN  monitor.ContainersMonitorImpl
>> >>> (ContainersMonitorImpl.java:isProcessTreeOverLimit(289)) - Process
>>tree for
>> >>> container: container_1316626117280_0002_01_000001 has processes
>>older than 1
>> >>> iteration running over the configured limit. Limit=2147483648,
>>current usage
>> >>> = 2260938752
>> >>> 2011-09-21 10:42:40,870 WARN  monitor.ContainersMonitorImpl
>> >>> (ContainersMonitorImpl.java:run(453)) - Container
>> >>> [pid=21666,containerID=container_1316626117280_0002_01_000001] is
>>running
>> >>> beyond memory-limits. Current usage : 2260938752bytes. Limit :
>> >>> 2147483648bytes. Killing container.
>> >>> Dump of the process-tree for container_1316626117280_0002_01_000001
>>:
>> >>>        |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
>> >>> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES)
>>FULL_CMD_LINE
>> >>>        |- 21669 21666 21666 21666 (java) 105 4 2152300544 18593 java
>> >>> -Xmx512M -cp ./package/* kafka.yarn.ApplicationMaster
>> >>> /home/criccomi/git/kafka-yarn/dist/kafka-streamer.tgz 2 1
>>1316626117280
>> >>> com.linkedin.TODO 1
>> >>>       |- 21666 20570 21666 21666 (bash) 0 0 108638208 303 /bin/bash
>>-c
>> >>> java -Xmx512M -cp './package/*' kafka.yarn.ApplicationMaster
>> >>> /home/criccomi/git/kafka-yarn/dist/kafka-streamer.tgz 2 1
>>1316626117280
>> >>> com.linkedin.TODO 1
>> >>>
>>1>/tmp/logs/application_1316626117280_0002/container_1316626117280_0002_0
>>1_00
>> >>> 0001/stdout
>> >>>
>>2>/tmp/logs/application_1316626117280_0002/container_1316626117280_0002_0
>>1_00
>> >>> 0001/stderr
>> >>>
>> >>> 2011-09-21 10:42:40,870 INFO  monitor.ContainersMonitorImpl
>> >>> (ContainersMonitorImpl.java:run(463)) - Removed ProcessTree with
>>root 21666
>> >>>
>> >>> I don't think that my AM is leaking memory. Full code paste after
>>the break
>> >>>
>> >>> 1. Do I need to release a container in my AM even if the AM
>>receives it as
>> >>> a finished container in the resource request response?
>> >>> 2. Do I need to free any other resources after a resource request
>>(e.g.
>> >>> ResourceRequest, AllocateRequest, etc)?
>> >>>
>> >>> Cheers,
>> >>> Chris
>> >>>
>> >>>
>> >>> def main(args: Array[String]) {
>> >>>   // YARN will always give our ApplicationMaster
>> >>>   // the first four parameters as: <package> <app id> <attempt id>
>> >>> <timestamp>
>> >>>   val packagePath = args(0)
>> >>>   val appId = args(1).toInt
>> >>>   val attemptId = args(2).toInt
>> >>>   val timestamp = args(3).toLong
>> >>>
>> >>>   // these are our application master's parameters
>> >>>   val streamerClass = args(4)
>> >>>   val tasks = args(5).toInt
>> >>>
>> >>>   // TODO log params here
>> >>>
>> >>>   // start the application master helper
>> >>>   val conf = new Configuration
>> >>>   val applicationMasterHelper = new ApplicationMasterHelper(appId,
>> >>> attemptId, timestamp, conf)
>> >>>     .registerWithResourceManager
>> >>>
>> >>>   // start and manage the slaves
>> >>>   val noReleases = List[ContainerId]()
>> >>>   var runningContainers = 0
>> >>>
>> >>>   // keep going forever
>> >>>   while (true) {
>> >>>     val nonRunningTasks = tasks - runningContainers
>> >>>     val response =
>> >>> applicationMasterHelper.sendResourceRequest(nonRunningTasks,
>>noReleases)
>> >>>
>> >>>     response.getAllocatedContainers.foreach(container => {
>> >>>       new ContainerExecutor(packagePath, container)
>> >>>         .addCommand("java -Xmx256M -cp './package/*'
>> >>> kafka.yarn.StreamingTask " + streamerClass + " "
>> >>>           + "1>" + ApplicationConstants.LOG_DIR_EXPANSION_VAR +
>>"/stdout "
>> >>>           + "2>" + ApplicationConstants.LOG_DIR_EXPANSION_VAR +
>> >>> "/stderr").execute(conf)
>> >>>     })
>> >>>
>> >>>     runningContainers += response.getAllocatedContainers.length
>> >>>     runningContainers -=
>>response.getCompletedContainersStatuses.length
>> >>>
>> >>>     Thread.sleep(1000)
>> >>>   }
>> >>>
>> >>>   applicationMasterHelper.unregisterWithResourceManager("SUCCESS")
>> >>> }
>> >>>
>> >>>
>> >>> class ApplicationMasterHelper(iAppId: Int, iAppAttemptId: Int,
>>lTimestamp:
>> >>> Long, conf: Configuration) {
>> >>> val rpc = YarnRPC.create(conf)
>> >>> val appId = Records.newRecord(classOf[ApplicationId])
>> >>> val appAttemptId = Records.newRecord(classOf[ApplicationAttemptId])
>> >>> val rmAddress =
>> >>>
>>NetUtils.createSocketAddr(conf.get(YarnConfiguration.RM_SCHEDULER_ADDRESS
>>,
>> >>> YarnConfiguration.DEFAULT_RM_SCHEDULER_ADDRESS))
>> >>> val resourceManager = rpc.getProxy(classOf[AMRMProtocol], rmAddress,
>> >>> conf).asInstanceOf[AMRMProtocol]
>> >>> var requestId = 0
>> >>>
>> >>> appId.setClusterTimestamp(lTimestamp)
>> >>> appId.setId(iAppId)
>> >>> appAttemptId.setApplicationId(appId)
>> >>> appAttemptId.setAttemptId(iAppAttemptId)
>> >>>
>> >>> def registerWithResourceManager(): ApplicationMasterHelper = {
>> >>>   val req =
>>Records.newRecord(classOf[RegisterApplicationMasterRequest])
>> >>>   req.setApplicationAttemptId(appAttemptId)
>> >>>   // TODO not sure why these are blank- This is how spark does it
>> >>>   req.setHost("")
>> >>>   req.setRpcPort(1)
>> >>>   req.setTrackingUrl("")
>> >>>   resourceManager.registerApplicationMaster(req)
>> >>>   this
>> >>> }
>> >>>
>> >>> def unregisterWithResourceManager(state: String):
>>ApplicationMasterHelper
>> >>> = {
>> >>>   val finReq =
>>Records.newRecord(classOf[FinishApplicationMasterRequest])
>> >>>   finReq.setAppAttemptId(appAttemptId)
>> >>>   finReq.setFinalState(state)
>> >>>   resourceManager.finishApplicationMaster(finReq)
>> >>>   this
>> >>> }
>> >>>
>> >>> def sendResourceRequest(containers: Int, release:
>>List[ContainerId]):
>> >>> AMResponse = {
>> >>>   // TODO will need to make this more flexible for hostname
>>requests, etc
>> >>>   val request = Records.newRecord(classOf[ResourceRequest])
>> >>>   val pri = Records.newRecord(classOf[Priority])
>> >>>   val capability = Records.newRecord(classOf[Resource])
>> >>>   val req = Records.newRecord(classOf[AllocateRequest])
>> >>>   request.setHostName("*")
>> >>>   request.setNumContainers(containers)
>> >>>   pri.setPriority(1)
>> >>>   request.setPriority(pri)
>> >>>   capability.setMemory(128)
>> >>>   request.setCapability(capability)
>> >>>   req.setResponseId(requestId)
>> >>>   req.setApplicationAttemptId(appAttemptId)
>> >>>   req.addAllAsks(Lists.newArrayList(request))
>> >>>   req.addAllReleases(release)
>> >>>   requestId += 1
>> >>>   // TODO we might want to return a list of container executors here
>> >>> instead of AMResponses
>> >>>   resourceManager.allocate(req).getAMResponse
>> >>> }
>> >>> }
>> >>>
>> >>>
>> >>> ________________________________________
>> >>> From: Vinod Kumar Vavilapalli [[email protected]]
>> >>> Sent: Wednesday, September 21, 2011 10:08 AM
>> >>> To: [email protected]
>> >>> Subject: Re: ApplicationMaster Memory Usage
>> >>>
>> >>> Yes, the process-dump clearly tells that this is MAPREDUCE-2998.
>> >>>
>> >>> +Vinod
>> >>> (With a smirk to see his container-memory-monitoring code in action)
>> >>>
>> >>>
>> >>> On Wed, Sep 21, 2011 at 10:26 PM, Arun C Murthy
>><[email protected]>
>> >>> wrote:
>> >>>
>> >>>> I'll bet you are hitting MR-2998.
>> >>>>
>> >>>> From the changelog:
>> >>>>
>> >>>>   MAPREDUCE-2998. Fixed a bug in TaskAttemptImpl which caused it
>>to fork
>> >>>> bin/mapred too many times. Contributed by Vinod K V.
>> >>>>
>> >>>> Arun
>> >>>>
>> >>>> On Sep 21, 2011, at 9:52 AM, Chris Riccomini wrote:
>> >>>>
>> >>>>> Hey Guys,
>> >>>>>
>> >>>>> My ApplicationMaster is being killed by the NodeManager because of
>> >>> memory
>> >>>> consumption, and I don't understand why. I'm using -Xmx512M, and
>>setting
>> >>> my
>> >>>> resource request to 2048.
>> >>>>>
>> >>>>>
>> >>>>>   .addCommand("java -Xmx512M -cp './package/*'
>> >>>> kafka.yarn.ApplicationMaster " ...
>> >>>>>
>> >>>>>   ...
>> >>>>>
>> >>>>>   private var memory = 2048
>> >>>>>
>> >>>>>   resource.setMemory(memory)
>> >>>>>   containerCtx.setResource(resource)
>> >>>>>   containerCtx.setCommands(cmds.toList)
>> >>>>>
>>containerCtx.setLocalResources(Collections.singletonMap("package",
>> >>>> packageResource))
>> >>>>>   appCtx.setApplicationId(appId)
>> >>>>>   appCtx.setUser(user.getShortUserName)
>> >>>>>   appCtx.setAMContainerSpec(containerCtx)
>> >>>>>   request.setApplicationSubmissionContext(appCtx)
>> >>>>>   applicationsManager.submitApplication(request)
>> >>>>>
>> >>>>> When this runs, I see (in my NodeManager's logs):
>> >>>>>
>> >>>>>
>> >>>>> 2011-09-21 09:35:19,112 INFO  monitor.ContainersMonitorImpl
>> >>>> (ContainersMonitorImpl.java:run(402)) - Memory usage of
>>ProcessTree 28134
>> >>>> for container-id container_1316559026783_0003_01_000001 : Virtual
>> >>> 2260938752
>> >>>> bytes, limit : 2147483648 bytes; Physical 71540736 bytes, limit -1
>>bytes
>> >>>>> 2011-09-21 09:35:19,112 WARN  monitor.ContainersMonitorImpl
>> >>>> (ContainersMonitorImpl.java:isProcessTreeOverLimit(289)) - Process
>>tree
>> >>> for
>> >>>> container: container_1316559026783_0003_01_000001 has processes
>>older
>> >>> than 1
>> >>>> iteration running over the configured limit. Limit=2147483648,
>>current
>> >>> usage
>> >>>> = 2260938752
>> >>>>> 2011-09-21 09:35:19,113 WARN  monitor.ContainersMonitorImpl
>> >>>> (ContainersMonitorImpl.java:run(453)) - Container
>> >>>> [pid=28134,containerID=container_1316559026783_0003_01_000001] is
>>running
>> >>>> beyond memory-limits. Current usage : 2260938752bytes. Limit :
>> >>>> 2147483648bytes. Killing container.
>> >>>>> Dump of the process-tree for
>>container_1316559026783_0003_01_000001 :
>> >>>>>      |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
>> >>>> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES)
>>FULL_CMD_LINE
>> >>>>>      |- 28134 25886 28134 28134 (bash) 0 0 108638208 303
>>/bin/bash -c
>> >>>> java -Xmx512M -cp './package/*' kafka.yarn.ApplicationMaster 3 1
>> >>>> 1316559026783 com.linkedin.TODO 1
>> >>>>
>> >>>
>>1>/tmp/logs/application_1316559026783_0003/container_1316559026783_0003_0
>>1_00
>> >>> 0001/stdout
>> >>>>
>> >>>
>>2>/tmp/logs/application_1316559026783_0003/container_1316559026783_0003_0
>>1_00
>> >>> 0001/stderr
>> >>>>>      |- 28137 28134 28134 28134 (java) 92 3 2152300544 17163 java
>> >>>> -Xmx512M -cp ./package/* kafka.yarn.ApplicationMaster 3 1
>>1316559026783
>> >>>> com.linkedin.TODO 1
>> >>>>>
>> >>>>> 2011-09-21 09:35:19,113 INFO  monitor.ContainersMonitorImpl
>> >>>> (ContainersMonitorImpl.java:run(463)) - Removed ProcessTree with
>>root
>> >>> 28134
>> >>>>>
>> >>>>> It appears that YARN is honoring my 2048 command, yet my process
>>is
>> >>>> somehow taking 2260938752 bytes. I don't think that I'm using
>>nearly that
>> >>>> much in permgen, and my heap is limited to 512. I don't have any
>>JNI
>> >>> stuff
>> >>>> running (that I know of), so it's unclear to me what's going on
>>here. The
>> >>>> only thing that I can think of is that Java's Runtime exec is
>>forking,
>> >>> and
>> >>>> copying its entire JVM memory footprint for the fork.
>> >>>>>
>> >>>>> Has anyone seen this? Am I doing something dumb?
>> >>>>>
>> >>>>> Thanks!
>> >>>>> Chris
>> >>>>
>> >>>>
>> >>>
>> >
>
>--
>This message is automatically generated by JIRA.
>For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>
>

Re: [jira] [Commented] (MAPREDUCE-3065) ApplicationMaster killed by NodeManager due to excessive virtual memory consumption

Reply via email to