[ 
https://issues.apache.org/jira/browse/TEZ-2251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated TEZ-2251:
----------------------------------
    Attachment: TEZ-2251.VertexImpl.readlock.patch

Update:
=======
Tried adding readlocks in VertexImpl.getInputSpecList & getInputSpecList. 
The issue still remains.  Dispatcher thread should have ideally blocked on 
getInputSpecList(). Not sure how it ended up seeing the wrong value "6" instead 
of "46".  

{noformat}
 2015-04-01 20:27:46,362 INFO [App Shared Pool - #6] 
vertexmanager.ShuffleVertexManager: Scheduling 46 tasks for vertex: Reducer 6 
with totalTasks: 46. 52 source tasks completed out of 64. 
SourceTaskCompletedFraction: 0.8125 min: 0.25 max: 0.75
2015-04-01 20:27:46,362 INFO [App Shared Pool - #6] 
vertexmanager.ShuffleVertexManager: Reduce auto parallelism for vertex: Reducer 
6 to 6 from 46 . Expected output: 28669934 based on actual output: 23294322 
from 52 vertex manager events.  desiredTaskInputSize: 67108864 max slow start 
tasks:48.0 num sources completed:52
2015-04-01 20:27:46,362 INFO [Dispatcher thread: Central] 
history.HistoryEventHandler: 
[HISTORY][DAG:dag_1424502260528_2022_1][Event:TASK_STARTED]: vertexName=Reducer 
5, taskId=task_1424502260528_2022_1_09_000000, scheduledTime=1427945266362, 
launchTime=1427945266362
2015-04-01 20:27:46,362 INFO [HistoryEventHandlingThread] 
impl.SimpleHistoryLoggingService: Writing event TASK_STARTED to history file
2015-04-01 20:27:46,362 INFO [App Shared Pool - #6] impl.VertexImpl: Resetting 
vertex location hints due to change in parallelism for vertex: 
vertex_1424502260528_2022_1_10 [Reducer 6]
2015-04-01 20:27:46,362 INFO [Dispatcher thread: Central] impl.TaskImpl: 
task_1424502260528_2022_1_09_000000 Task Transitioned from NEW to SCHEDULED due 
to event T_SCHEDULE
2015-04-01 20:27:46,363 INFO [App Shared Pool - #6] impl.VertexImpl: Removing 
task: task_1424502260528_2022_1_10_000006
....
....
2015-04-01 20:27:46,363 INFO [App Shared Pool - #6] impl.VertexImpl: Vertex 
vertex_1424502260528_2022_1_10 [Reducer 6] parallelism set to 6 from 46
2015-04-01 20:27:46,363 INFO [App Shared Pool - #6] impl.VertexImpl: Replacing 
edge manager for source:Reducer 5 destination: vertex_1424502260528_2022_1_10 
[Reducer 6]
2015-04-01 20:27:46,363 INFO [App Shared Pool - #6] impl.VertexImpl: Replacing 
edge manager for source:Map 11 destination: vertex_1424502260528_2022_1_10 
[Reducer 6]
2015-04-01 20:27:46,363 INFO [HistoryEventHandlingThread] 
impl.SimpleHistoryLoggingService: Writing event VERTEX_PARALLELISM_UPDATED to 
history file
2015-04-01 20:27:46,363 INFO [App Shared Pool - #6] 
history.HistoryEventHandler: 
[HISTORY][DAG:dag_1424502260528_2022_1][Event:VERTEX_PARALLELISM_UPDATED]: 
vertexId=vertex_1424502260528_2022_1_10, numTasks=6, vertexLocationHint=null, 
edgeManagersCount=2, rootInputSpecUpdateCount=null
2015-04-01 20:27:46,363 INFO [Dispatcher thread: Central] impl.TaskAttemptImpl: 
remoteTaskSpec:DAGName : 
rajesh_20150401202556_46121d51-f387-461f-a8e9-52028134010c:1, VertexName: 
Reducer 5, VertexParallelism: 1, 
TaskAttemptID:attempt_1424502260528_2022_1_09_000000_0, 
processorName=org.apache.hadoop.hive.ql.exec.tez.ReduceTezProcessor, 
inputSpecListSize=1, outputSpecListSize=1, inputSpecList=[{{ 
sourceVertexName=Reducer 4, physicalEdgeCount=4044, 
inputClassName=org.apache.tez.runtime.library.input.OrderedGroupedKVInput }}, 
], outputSpecList=[{{ destinationVertexName=Reducer 6, physicalEdgeCount=6, 
outputClassName=org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput
 }}, ]
{noformat}

Another issue is, if I add any debug/info logs, job finishes fine as it sort of 
introduces a delay in the processing.

> Enabling auto reduce parallelism in certain jobs causes DAG to hang
> -------------------------------------------------------------------
>
>                 Key: TEZ-2251
>                 URL: https://issues.apache.org/jira/browse/TEZ-2251
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>         Attachments: TEZ-2251.VertexImpl.patch, 
> TEZ-2251.VertexImpl.readlock.patch, TEZ-2251.fix_but_slows_down.patch, 
> hive_console.png, tez-2251.vertexpatch.am.log.gz, tez_2251_dag.png
>
>
> Scenario:
> - Run TPCH query20 
> (https://github.com/cartershanklin/hive-testbench/blob/master/sample-queries-tpch/tpch_query20.sql)
>  at 1 TB scale (tez-master branch, hive trunk)
> - Enable auto reduce parallelism
> - DAG didn't complete and got stuck in "Reducer 6"
> Vertex parallelism of "Reducer 5 & 6" happens within a span of 3 
> milliseconds, and tasks of "reducer 5" ends up producing wrong partition 
> details as it sees the updated task numbers of reducer 6 when scheduled.  
> This causes, job to hang.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to