[jira] [Commented] (TEZ-2186) OOM with a simple scatter gather job with re-use

Rajesh Balamohan (JIRA) Tue, 10 Mar 2015 01:31:25 -0700

    [ 
https://issues.apache.org/jira/browse/TEZ-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14354537#comment-14354537
 ]


Rajesh Balamohan commented on TEZ-2186:
---------------------------------------

Yet to reproduce the OOM on my test runs. But with NoOpProcessor (closing the 
input too fast without much operation), seems to throw the following exception 
in Shuffle.shutdown()

This causes thread leaks, as it would not cleanly shutdown fetcher threads, 
refree thread in shuffleScheduler and merger threads.  I can see bunch of these 
threads being alive in the next iteration.

{noformat}
java.util.concurrent.CancellationException
        at java.util.concurrent.FutureTask.report(FutureTask.java:121)
        at java.util.concurrent.FutureTask.get(FutureTask.java:192)
        at 
com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:132)
        at com.google.common.util.concurrent.Futures$6.run(Futures.java:974)
        at 
com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:253)
        at 
com.google.common.util.concurrent.ExecutionList$RunnableExecutorPair.execute(ExecutionList.java:149)
        at 
com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:134)
        at 
com.google.common.util.concurrent.ListenableFutureTask.done(ListenableFutureTask.java:86)
        at java.util.concurrent.FutureTask.finishCompletion(FutureTask.java:384)
        at java.util.concurrent.FutureTask.cancel(FutureTask.java:180)
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle.shutdown(Shuffle.java:323)
        at 
org.apache.tez.runtime.library.input.OrderedGroupedKVInput.close(OrderedGroupedKVInput.java:186)
        at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.close(LogicalIOProcessorRuntimeTask.java:345)
        at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:181)
        at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
        at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171)
        at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167)
        at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
{noformat}


for example, I tried printing all the threads(in the JVM) in the constructor of 
common.shuffle.orderedgrouped.Shuffle(). This is much ahead of starting 
fetchers, merger threads etc.  Here is an example from the debug patch i had 
{noformat}
2015-03-10 01:02:52,595 INFO [TezChild] orderedgrouped.Shuffle: 
Patch..threadId=159, 
name=TezTaskEventRouter[attempt_1424502260528_0715_1_01_001585_0], state=WAITING
2015-03-10 01:02:52,595 INFO [TezChild] orderedgrouped.Shuffle: 
Patch..threadId=157, name=fetcher [map] #12, state=WAITING
2015-03-10 01:02:52,595 INFO [TezChild] orderedgrouped.Shuffle: 
Patch..threadId=156, name=fetcher [map] #11, state=WAITING
2015-03-10 01:02:52,595 INFO [TezChild] orderedgrouped.Shuffle: 
Patch..threadId=155, name=fetcher [map] #10, state=WAITING
2015-03-10 01:02:52,595 INFO [TezChild] orderedgrouped.Shuffle: 
Patch..threadId=142, name=fetcher [map] #9, state=WAITING
2015-03-10 01:02:52,595 INFO [TezChild] orderedgrouped.Shuffle: 
Patch..threadId=141, name=fetcher [map] #8, state=WAITING
2015-03-10 01:02:52,595 INFO [TezChild] orderedgrouped.Shuffle: 
Patch..threadId=140, name=fetcher [map] #7, state=WAITING
2015-03-10 01:02:52,595 INFO [TezChild] orderedgrouped.Shuffle: 
Patch..threadId=121, name=fetcher [map] #6, state=WAITING
2015-03-10 01:02:52,595 INFO [TezChild] orderedgrouped.Shuffle: 
Patch..threadId=120, name=fetcher [map] #5, state=WAITING
2015-03-10 01:02:52,595 INFO [TezChild] orderedgrouped.Shuffle: 
Patch..threadId=119, name=fetcher [map] #4, state=WAITING
2015-03-10 01:02:52,596 INFO [TezChild] orderedgrouped.Shuffle: 
Patch..threadId=112, name=fetcher [map] #3, state=WAITING
2015-03-10 01:02:52,596 INFO [TezChild] orderedgrouped.Shuffle: 
Patch..threadId=111, name=fetcher [map] #2, state=WAITING
2015-03-10 01:02:52,596 INFO [TezChild] orderedgrouped.Shuffle: 
Patch..threadId=110, name=fetcher [map] #1, state=WAITING
2015-03-10 01:02:52,596 INFO [TezChild] orderedgrouped.Shuffle: 
Patch..threadId=24, name=TaskHeartbeatThread, state=WAITING
2015-03-10 01:02:52,596 INFO [TezChild] orderedgrouped.Shuffle: 
Patch..threadId=23, name=IPC Parameter Sending Thread #0, state=TIMED_WAITING
2015-03-10 01:02:52,596 INFO [TezChild] orderedgrouped.Shuffle: 
Patch..threadId=22, name=IPC Client (1279309678) connection to 
/172.19.128.52:34897 from application_1424502260528_0715, state=RUNNABLE
2015-03-10 01:02:52,596 INFO [TezChild] orderedgrouped.Shuffle: 
Patch..threadId=21, name=TezChild, state=RUNNABLE
2015-03-10 01:02:52,596 INFO [TezChild] orderedgrouped.Shuffle: 
Patch..threadId=4, name=Signal Dispatcher, state=RUNNABLE
2015-03-10 01:02:52,596 INFO [TezChild] orderedgrouped.Shuffle: 
Patch..threadId=3, name=Finalizer, state=WAITING
2015-03-10 01:02:52,596 INFO [TezChild] orderedgrouped.Shuffle: 
Patch..threadId=2, name=Reference Handler, state=WAITING
2015-03-10 01:02:52,596 INFO [TezChild] orderedgrouped.Shuffle: 
Patch..threadId=1, name=main, state=WAITING
{noformat}

The previous attempt was not successful in cleaning up fetcher & merger 
threads. 

> OOM with a simple scatter gather job with re-use
> ------------------------------------------------
>
>                 Key: TEZ-2186
>                 URL: https://issues.apache.org/jira/browse/TEZ-2186
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Siddharth Seth
>         Attachments: noopexample.txt
>
>
> With a no-op scatter gather job, 20K x 2K, on a 20 node cluster with 20 2GB 
> containers per node - reducers end up failing with OOM errors. Haven't been 
> able to generate a heap dump yet. Will add details as they're found. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-2186) OOM with a simple scatter gather job with re-use

Reply via email to