[jira] [Updated] (TEZ-2943) Change shuffle vertex manager to use per vertex data for auto reduce and slow start

Bikas Saha (JIRA) Fri, 11 Dec 2015 19:01:04 -0800

     [ 
https://issues.apache.org/jira/browse/TEZ-2943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Bikas Saha updated TEZ-2943:
----------------------------
    Fix Version/s: 0.7.1

> Change shuffle vertex manager to use per vertex data for auto reduce and slow 
> start
> -----------------------------------------------------------------------------------
>
>                 Key: TEZ-2943
>                 URL: https://issues.apache.org/jira/browse/TEZ-2943
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Jonathan Eagles
>            Assignee: Bikas Saha
>             Fix For: 0.7.1, 0.8.2
>
>         Attachments: TEZ-2943.1.patch, TEZ-2943.2.patch, TEZ-2943.3.patch
>
>
> Excerpts from a DAG execution log highlight the issue.
> A DAG has a vertex scope-212 that has two input sources scope-210 and 
> scope-211. The input properties have the following data movement properties.
> {noformat}
> scope-211: 
> -Tasks count: 72
> -OUTPUT_BYTES:93,975,296
> scope-210
> -Tasks count: 5
> -OUTPUT_BYTES: 2,315,364,586
> {noformat}
> Here is scope-212 auto reduce parallelism kicking in
> {code}
> 2015-11-11 19:46:28,829 [INFO] [App Shared Pool - #0] 
> |vertexmanager.ShuffleVertexManager|: Reduce auto parallelism for vertex: 
> scope-212 to 1 from 56 . Expected output: 101293660 based on actual output: 
> 76299121 from 58 vertex manager events.  desiredTaskInputSize: 134217728 max 
> slow start tasks:57.75 num sources completed:58
> {code}
> Some more background on why we determined the auto parallelism and started 
> scheduling tasks
> {code}
> 2015-11-11 19:46:28,829 [INFO] [App Shared Pool - #0] 
> |vertexmanager.ShuffleVertexManager|: Scheduling 56 tasks for vertex: 
> scope-212 with totalTasks: 56. 58 source tasks completed out of 77. 
> SourceTaskCompletedFraction: 0.7532467 min: 0.25 max: 0.75
> {code}
> It made this decision since 58/77 is roughly 75%. Most of the data came from 
> scope-210. let's see how many of the 58 source tasks completed are from 
> scope-210
> This is the first task attempt from scope-210
> {code}
> 2015-11-11 19:47:02,862 [INFO] [Dispatcher thread {Central}] 
> |history.HistoryEventHandler|: 
> [HISTORY][DAG:dag_1446259040496_444992_1][Event:TASK_ATTEMPT_FINISHED]: 
> vertexName=scope-210, 
> taskAttemptId=attempt_1446259040496_444992_1_02_000004_0, 
> creationTime=1447271172510, allocationTime=1447271175440, 
> startTime=1447271181036, finishTime=1447271222861, timeTaken=41825, 
> status=SUCCEEDED, errorEnum=, diagnostics=
> {code}
> This is almost 30 seconds after the auto-reduce parallelism kicked in. So it 
> seems we have a serious auto-reduce parallelism bug since it 1) doesn't 
> require 75% percent of each source vertex and/or 2) it doesn't weight the 
> expected bytes per vertex.
> This means that it launched only 1 task for 2.5G of data instead of 16
> This may relate to TEZ-1532 since events do not signify which source vertex 
> or task they originate from.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TEZ-2943) Change shuffle vertex manager to use per vertex data for auto reduce and slow start

Reply via email to