[
https://issues.apache.org/jira/browse/TEZ-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rajesh Balamohan updated TEZ-2186:
----------------------------------
Attachment: TEZ-2186.1.patch
Reproduced the issue with 50K x 50K with 2 GB container.
{noformat}
2015-03-10 03:20:35,369 ERROR [TezChild] task.TezTaskRunner: Exception of type
Error.
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:714)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MergeManager.configureAndStart(MergeManager.java:288)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle.run(Shuffle.java:313)
at
org.apache.tez.runtime.library.input.OrderedGroupedKVInput.start(OrderedGroupedKVInput.java:125)
at
org.apache.tez.runtime.library.processor.SimpleProcessor.preOp(SimpleProcessor.java:73)
at
org.apache.tez.runtime.library.processor.SimpleProcessor.run(SimpleProcessor.java:52)
at
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:330)
at
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:185)
at
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:177)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:177)
at
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:173)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2015-03-10 03:20:35,431 WARN [TaskHeartbeatThread] task.TaskReporter: Current
task already complete, Ignoring all event in heartbeat response, eventCount=500
{noformat}
Issue is that, with NoOpProcessor, Shuffle.shutdown() got called when Fetchers
are being added to the list in RunShuffleCallable. So it ends up throwing
ConcurrentModificationException in cleanupFetchers() and causing thread leaks.
Following is the stack trace for the concurrent modification exception.
{noformat}
java.util.ConcurrentModificationException
at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901)
at java.util.ArrayList$Itr.next(ArrayList.java:851)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle.cleanupFetchers(Shuffle.java:410)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle.cleanupIgnoreErrors(Shuffle.java:465)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle.shutdown(Shuffle.java:329)
at
org.apache.tez.runtime.library.input.OrderedGroupedKVInput.close(OrderedGroupedKVInput.java:186)
at
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.close(LogicalIOProcessorRuntimeTask.java:345)
at
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:181)
at
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171)
at
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
{noformat}
Attaching the first cut patch which fixes the fetcher threads issue. Ran 50K x
50K with the patch. This fixes the fetcher thread leaks. Number of failed tasks
have drastically reduced to 42 (NUM_FAILED_TASKS=42,
NUM_SUCCEEDED_TASKS=100000, TOTAL_LAUNCHED_TASKS=100042). Will check if there
are merger threads issues.
> OOM with a simple scatter gather job with re-use
> ------------------------------------------------
>
> Key: TEZ-2186
> URL: https://issues.apache.org/jira/browse/TEZ-2186
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Siddharth Seth
> Attachments: TEZ-2186.1.patch, noopexample.txt
>
>
> With a no-op scatter gather job, 20K x 2K, on a 20 node cluster with 20 2GB
> containers per node - reducers end up failing with OOM errors. Haven't been
> able to generate a heap dump yet. Will add details as they're found.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)