[ 
https://issues.apache.org/jira/browse/TEZ-3700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated TEZ-3700:
----------------------------------
    Attachment: TEZ-3700.3.patch

When debugging, encountered a deadlock in ShuffleScheduler. This is very corner 
case and is not easily reprodcable. Attaching the snippet of the stack trace 
here

{noformat}

  32 "Fetcher_O {Map_1} #3" #6732 daemon prio=5 os_prio=0 
tid=0x00007f9ec1a70000 nid=0x5a66 waiting for monitor entry [0x00007fabb33f2000]
  33    java.lang.Thread.State: BLOCKED (on object monitor)
  34         at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.copySucceeded(ShuffleScheduler.java:566)
  35         - waiting to lock <0x00007fb034000140> (a 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler)
  36         at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyMapOutput(FetcherOrderedGrouped.java:517)
  37         at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:272)
…
…

  92 "TezTaskEventRouter{attempt_1490656001509_1186_1_01_000286_0}" #6728 
daemon prio=5 os_prio=0 tid=0x00007f9efa18c800 nid=0x5a62 in Object.wait() 
[0x00007fabdd051000]
  93    java.lang.Thread.State: WAITING (on object monitor)
  94         at java.lang.Object.wait(Native Method)
  95         at java.lang.Thread.join(Thread.java:1245)
  96         - locked <0x00007fafc4000550> (a 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler$Referee)
  97         at java.lang.Thread.join(Thread.java:1319)
  98         at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.close(ShuffleScheduler.java:471)
  99         at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle.cleanupShuffleScheduler(Shuffle.java:350)
 100         at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle.cleanupShuffleSchedulerIgnoreErrors(Shuffle.java:341)
 101         at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle.reportException(Shuffle.java:405)
 102         - locked <0x00007fae400001b0> (a 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle)
 103         at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.obsoleteInput(ShuffleScheduler.java:1081)
 104         - locked <0x00007fb034000140> (a 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler)
 105         at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleInputEventHandlerOrderedGrouped.processTaskFailedEvent(ShuffleInputEventHandlerOrderedGrouped.java:169)
 106         at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleInputEventHandlerOrderedGrouped.handleEvent(ShuffleInputEventHandlerOrderedGrouped.java:127)

….
…

 147 "ShufflePenaltyReferee {Map_1}" #6726 daemon prio=5 os_prio=0 
tid=0x00007f9efb9a7000 nid=0x5a5d waiting for monitor entry [0x00007fabc0dcc000]
 148    java.lang.Thread.State: BLOCKED (on object monitor)
 149         at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler$Referee.run(ShuffleScheduler.java:1295)
 150         - waiting to lock <0x00007fb034000140> (a 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler)
{noformat}

This is due to corner in obsoleteInputs are handled. {{public void 
obsoleteInput}} takes a lock on {{ShuffleScheduler}}. At the same time, it 
tries to shutdown {{Refree}} thread, which internally tries to get 
{{ShuffleScheduler}}. Fixed this in current patch itself.

Changes:
1. Added {{ShuffleEventInfo::scheduledForDownload}} to indicate whether 
fetchers have been assigned events for downloading data.
2. Instead of failing the consumer on validation, it uses {{killSelf}} to avoid 
the job from failing. 
3. Same logic applied in {{obsoleteInput}} as well.
4. Removed method level {{synchornized}} in {{obsoleteInput}}.
5. In case of failed task attempt, vertex does not propogate any future events 
downstream. 
6. Modified test cases.

[~sseth] - Can you plz review when you find time.

> Consumer attempt should kill itself instead of failing during validation 
> checks with final merge avoidance
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: TEZ-3700
>                 URL: https://issues.apache.org/jira/browse/TEZ-3700
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>         Attachments: TEZ-3700.1.patch, TEZ-3700.2.patch, TEZ-3700.3.patch
>
>
> Currently when if data is received from different attempts with final merge 
> disabled (with/without pipleining), consumer attempt ends up with failure. 
> Instead it should issue kill request so that the job deos not end up with 
> failures.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to