[ 
https://issues.apache.org/jira/browse/TEZ-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated TEZ-2186:
----------------------------------
    Attachment: TEZ-2186.1.patch

Reproduced the issue with 50K x 50K with 2 GB container. 

{noformat}
2015-03-10 03:20:35,369 ERROR [TezChild] task.TezTaskRunner: Exception of type 
Error.
java.lang.OutOfMemoryError: unable to create new native thread
        at java.lang.Thread.start0(Native Method)
        at java.lang.Thread.start(Thread.java:714)
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MergeManager.configureAndStart(MergeManager.java:288)
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle.run(Shuffle.java:313)
        at 
org.apache.tez.runtime.library.input.OrderedGroupedKVInput.start(OrderedGroupedKVInput.java:125)
        at 
org.apache.tez.runtime.library.processor.SimpleProcessor.preOp(SimpleProcessor.java:73)
        at 
org.apache.tez.runtime.library.processor.SimpleProcessor.run(SimpleProcessor.java:52)
        at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:330)
        at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:185)
        at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:177)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
        at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:177)
        at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:173)
        at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
2015-03-10 03:20:35,431 WARN [TaskHeartbeatThread] task.TaskReporter: Current 
task already complete, Ignoring all event in heartbeat response, eventCount=500
{noformat}

Issue is that, with NoOpProcessor, Shuffle.shutdown() got called when Fetchers 
are being added to the list in RunShuffleCallable.  So it ends up throwing 
ConcurrentModificationException in cleanupFetchers() and causing thread leaks. 
Following is the stack trace for the concurrent modification exception.

{noformat}
java.util.ConcurrentModificationException
        at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901)
        at java.util.ArrayList$Itr.next(ArrayList.java:851)
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle.cleanupFetchers(Shuffle.java:410)
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle.cleanupIgnoreErrors(Shuffle.java:465)
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle.shutdown(Shuffle.java:329)
        at 
org.apache.tez.runtime.library.input.OrderedGroupedKVInput.close(OrderedGroupedKVInput.java:186)
        at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.close(LogicalIOProcessorRuntimeTask.java:345)
        at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:181)
        at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
        at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171)
        at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167)
        at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
{noformat}



Attaching the first cut patch which fixes the fetcher threads issue. Ran 50K x 
50K with the patch. This fixes the fetcher thread leaks. Number of failed tasks 
have drastically reduced to 42 (NUM_FAILED_TASKS=42, 
NUM_SUCCEEDED_TASKS=100000, TOTAL_LAUNCHED_TASKS=100042). Will check if there 
are merger threads issues.

> OOM with a simple scatter gather job with re-use
> ------------------------------------------------
>
>                 Key: TEZ-2186
>                 URL: https://issues.apache.org/jira/browse/TEZ-2186
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Siddharth Seth
>         Attachments: TEZ-2186.1.patch, noopexample.txt
>
>
> With a no-op scatter gather job, 20K x 2K, on a 20 node cluster with 20 2GB 
> containers per node - reducers end up failing with OOM errors. Haven't been 
> able to generate a heap dump yet. Will add details as they're found. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to