[ 
https://issues.apache.org/jira/browse/TEZ-14?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207618#comment-14207618
 ] 

Bikas Saha commented on TEZ-14:
-------------------------------

Addressed the main comments from both reviews.

I am going to keep the event to sending the update because updates can trigger 
speculations and write locking and making direct invocations from 
TaskAttempt/Task/Vertex or vice versa is not going to be safe. It will also 
have to poll all running attempts at regular intervals etc. If the queue on the 
AM is backed up then speculation is probably not the first thing to worry 
about. Even before the update event, the event from the heartbeat handler to 
the attempt itself will be stuck. The code is written to be safe against 
separating the dispatchers for vertices/task/attempts. So when we make that 
change we should be ok. Since this update is essentially per attempt and not 
per task, its seems correct to send from attempt instead of going via task.

Looks like the speculation/estimation code works even with progress=0 because 
in that case the calculations end up looking at completed runtimes. Added 
testcase for both progress updates and progress=0. Both work but likely we will 
have to look at real executions on the cluster to see if this has any other 
issues.

KillTransition was edited because it is legal for a leaf vertex task attempt to 
be killed after success but illegal for it to be failed after success since 
read errors cannot be reported for it.

DAGImpl is returning a TaskImpl. So it should be fine since they both are in 
Impl land and not the interface land.

The porting of the code is trying to keep the ported code changes minimal so 
that we can track and compare against MR while debugging issues etc. So keeping 
the code structure. Also not making the values configurable since they arent in 
MR and probably tuned. If needed we can make them configurable later on.

Added more tests for TaskAttempt and DataStatistics test is ported over.

> Support for speculation of slow tasks
> -------------------------------------
>
>                 Key: TEZ-14
>                 URL: https://issues.apache.org/jira/browse/TEZ-14
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Bikas Saha
>            Assignee: Bikas Saha
>         Attachments: TEZ-14.1.patch, TEZ-14.2.patch, TEZ-14.3.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to