[
https://issues.apache.org/jira/browse/TEZ-3700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rajesh Balamohan updated TEZ-3700:
----------------------------------
Attachment: TEZ-3700.3.patch
When debugging, encountered a deadlock in ShuffleScheduler. This is very corner
case and is not easily reprodcable. Attaching the snippet of the stack trace
here
{noformat}
32 "Fetcher_O {Map_1} #3" #6732 daemon prio=5 os_prio=0
tid=0x00007f9ec1a70000 nid=0x5a66 waiting for monitor entry [0x00007fabb33f2000]
33 java.lang.Thread.State: BLOCKED (on object monitor)
34 at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.copySucceeded(ShuffleScheduler.java:566)
35 - waiting to lock <0x00007fb034000140> (a
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler)
36 at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyMapOutput(FetcherOrderedGrouped.java:517)
37 at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:272)
…
…
92 "TezTaskEventRouter{attempt_1490656001509_1186_1_01_000286_0}" #6728
daemon prio=5 os_prio=0 tid=0x00007f9efa18c800 nid=0x5a62 in Object.wait()
[0x00007fabdd051000]
93 java.lang.Thread.State: WAITING (on object monitor)
94 at java.lang.Object.wait(Native Method)
95 at java.lang.Thread.join(Thread.java:1245)
96 - locked <0x00007fafc4000550> (a
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler$Referee)
97 at java.lang.Thread.join(Thread.java:1319)
98 at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.close(ShuffleScheduler.java:471)
99 at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle.cleanupShuffleScheduler(Shuffle.java:350)
100 at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle.cleanupShuffleSchedulerIgnoreErrors(Shuffle.java:341)
101 at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle.reportException(Shuffle.java:405)
102 - locked <0x00007fae400001b0> (a
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle)
103 at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.obsoleteInput(ShuffleScheduler.java:1081)
104 - locked <0x00007fb034000140> (a
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler)
105 at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleInputEventHandlerOrderedGrouped.processTaskFailedEvent(ShuffleInputEventHandlerOrderedGrouped.java:169)
106 at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleInputEventHandlerOrderedGrouped.handleEvent(ShuffleInputEventHandlerOrderedGrouped.java:127)
….
…
147 "ShufflePenaltyReferee {Map_1}" #6726 daemon prio=5 os_prio=0
tid=0x00007f9efb9a7000 nid=0x5a5d waiting for monitor entry [0x00007fabc0dcc000]
148 java.lang.Thread.State: BLOCKED (on object monitor)
149 at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler$Referee.run(ShuffleScheduler.java:1295)
150 - waiting to lock <0x00007fb034000140> (a
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler)
{noformat}
This is due to corner in obsoleteInputs are handled. {{public void
obsoleteInput}} takes a lock on {{ShuffleScheduler}}. At the same time, it
tries to shutdown {{Refree}} thread, which internally tries to get
{{ShuffleScheduler}}. Fixed this in current patch itself.
Changes:
1. Added {{ShuffleEventInfo::scheduledForDownload}} to indicate whether
fetchers have been assigned events for downloading data.
2. Instead of failing the consumer on validation, it uses {{killSelf}} to avoid
the job from failing.
3. Same logic applied in {{obsoleteInput}} as well.
4. Removed method level {{synchornized}} in {{obsoleteInput}}.
5. In case of failed task attempt, vertex does not propogate any future events
downstream.
6. Modified test cases.
[~sseth] - Can you plz review when you find time.
> Consumer attempt should kill itself instead of failing during validation
> checks with final merge avoidance
> ----------------------------------------------------------------------------------------------------------
>
> Key: TEZ-3700
> URL: https://issues.apache.org/jira/browse/TEZ-3700
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Rajesh Balamohan
> Assignee: Rajesh Balamohan
> Attachments: TEZ-3700.1.patch, TEZ-3700.2.patch, TEZ-3700.3.patch
>
>
> Currently when if data is received from different attempts with final merge
> disabled (with/without pipleining), consumer attempt ends up with failure.
> Instead it should issue kill request so that the job deos not end up with
> failures.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)