[jira] [Commented] (TEZ-14) Support for speculation of slow tasks

Hitesh Shah (JIRA) Mon, 10 Nov 2014 15:36:12 -0800

    [ 
https://issues.apache.org/jira/browse/TEZ-14?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205562#comment-14205562
 ]


Hitesh Shah commented on TEZ-14:
--------------------------------

Comments:

  - not sure how this is really helping - VertexEventTaskAttemptStatusUpdate. 
It has a justStarted boolean field which is a bit of a concern. I also have a 
worry about how this event is created, the timelag on when it was created and 
when it is actually seen by a consumer. Plus, the additional "// TODO send copy 
of status to prevent race???".

 - VertexEventType.V_TASK_ATTEMPT_STATUS_UPDATE may need to be other 
transitions too. For example, TaskCompletedAfterVertexSuccessTransition

{code}
if (vertex.conf.getBoolean(TezConfiguration.TEZ_AM_SPECULATION_ENABLED,
+          TezConfiguration.TEZ_AM_SPECULATION_ENABLED_DEFAULT)) {
{code}
  - change to the flag to be a field of VertexImpl? 

 - no docs and tests added for DataStatistics

 - MAX_WAITTING_TIME_FOR_HEARTBEAT - hardcoded instead of relying on a multiple 
of configured heartbeat value - likewise for some other properties. 

 - Commented out code in LegacySpeculator shoudl be removed. 

{code}
+  public long estimatedNewAttemptRuntime(TezTaskID id) {
+    return (long)mapperStatistics.mean();
+  }
{code}
  - why does this need a taskId param? 

{code}
+        float progress = taskAttempt.getProgress();
{code}
  - today, the inputs and outputs are not wired to report progress. This may 
create a problem where the reducer fetch will trigger speculations.

{code}
+    // If we are here, there's at most one task attempt.
+    if (numberRunningAttempts == 0) {
+      return NOT_RUNNING;
+    }
+
{code}
  - couldnt this just rely on task state which goes into running only after the 
first attempt heartbeats back? 

 - is an interface for TaskRuntimeEstimator needed? 

 - most new classes could use some overall docs
 - additional docs in the speculator to explain logic would help

Other general concerns:
   - how to disable sending attempt status events if speculation is disabled
   - The status update may be read very late by the speculator if there is an 
event backlog in the AM - should the speculator just query the task for its 
state instead of relying on the event? 
  






> Support for speculation of slow tasks
> -------------------------------------
>
>                 Key: TEZ-14
>                 URL: https://issues.apache.org/jira/browse/TEZ-14
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Bikas Saha
>            Assignee: Bikas Saha
>         Attachments: TEZ-14.1.patch, TEZ-14.2.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-14) Support for speculation of slow tasks

Reply via email to