[
https://issues.apache.org/jira/browse/TEZ-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16978739#comment-16978739
]
Ahmed Hussein commented on TEZ-4067:
------------------------------------
[~jeagles], I tried to refresh my memory a little bit. There was check on the
service state to prevent starting the service more than once.
The workflow of the {{DAGAppMaster}} works as follow and correct me if I a
wrong:
* {{DAGAppMaster}} is created
* Services get initialized. this is the phase when the services are added to
the "{{DAGAppMaster.services}}" map.
* all the services are started inside {{serviceStart.startServices()}}. Note
that the {{DAG}} is not created yet.
* {{startDag()}} and {{startDagExecution}} finally create the DAG
"{{currentDAG}}" and its vertices.
This workflow requires that speculators are started and initialized separately
after the DAG is created. Although, we can still add them to the services map
though, we cannot assume that they will start automatically in
{{DAGAppMaster.serviceStart()}}.
Same for {{DAGAppMaster.serviceStop()}}. The latter is called at the end of the
execution. Therefore, a service in "{{DAGAppMaster.services}}" map will stay
around until the whole DAG is completed. Given that a vertex can be completed,
the speculator service related to that vertex will hang around until the
{{DAGAppMaster}} is completed.
If we add the speculators to "{{DAGAppMaster.services}}", we won't be able to
remove the service when a vertex is completed, since a {{Vertex/DAGImpl}} does
not have access to the "{{DAGAppMaster.services}}".
I am almost done with implementing the code based on your suggestions. If you
think that having speculators stay alive until DAG is completed, then I will go
ahead and upload the patch. Otherwise, I will work on few changes to remove the
speculator of a completed vertex.
Let me know WDYT.
> Tez Speculation decision is calculated on each update by the dispatcher
> -----------------------------------------------------------------------
>
> Key: TEZ-4067
> URL: https://issues.apache.org/jira/browse/TEZ-4067
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Ahmed Hussein
> Assignee: Ahmed Hussein
> Priority: Minor
> Attachments: TEZ-4067.001.patch, TEZ-4067.002.patch,
> TEZ-4067.003.patch, TEZ-4067.004.patch, TEZ-4067.005.patch
>
>
> LegacySpeculator is an object field in VertexImpl. Therefore, all events are
> handled synchronously by the caller (dispatcher). This implies the following:
> # the dispatcher spends long time executing updateStatus as it needs to
> check the runtime estimation of the tezAttempts within the vertex.
> # the speculator is per stage: lunching a speculation may not the optimum
> decision. Ideally, based on resources, speculated tasks should be the ones
> with slowest progress.
> # the time between speculation is skewed because there is a big delay for
> the dispatcher to complete a full cycle. Also, speculation will be more
> aggressive compared to MR because MR waits for
> "soonest.retry.after.speculate" whenever a task is speculated. On the other
> hand, Tez speculates more tasks as it processes stages in parallel.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)