[
https://issues.apache.org/jira/browse/TEZ-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14600198#comment-14600198
]
Bikas Saha commented on TEZ-2565:
---------------------------------
1) the current release of Tez returns all stats (running/completed) as far as
the API is concerned. Hence, changing the API semantics to say that it only
returns completed task stats is an incompatible change. It should be recorded
as such. Alternatively, create a new API that allows specifying what states are
desired and returning those. We can punt on the running tasks for now, but my
guess is that we will need to fix that soon with pipelined shuffle events.
Stats would be available much earlier than the completion time of the task,
specially for skewed tasks, which are the main optimization targets for all
this work. (In fact, even without pipelining this could be done by simply
reporting stats from the outputs more often than is currently being done). The
earlier we can detect such skews the better. Since this jira is looking at
re-working this code path, ideally we should be looking at fixing this broadly
for the anticipated use cases. However, I am fine if we choose punt it for
later.
2) On the patch itself, reset of the cache upon task failures should prevent
the need for checking null all the time. Not sure why constructStats() is
merging the stats from each task every time. This merge could happen in
TaskCompletedTransition and the getStats() could simply return construcStats().
When the cache is invalidated upon re-run, then the transition could reset the
cache and re-populate the completed tasks within that transition. Given the
above, its unclear, why we are seeing reduction in CPU times. Perhaps that is
due to ignoring running tasks and not due to any optimization for the completed
tasks because those seem to be merged every time.
> Consider scanning unfinished tasks in VertexImpl::constructStatistics to
> reduce merge overhead
> ----------------------------------------------------------------------------------------------
>
> Key: TEZ-2565
> URL: https://issues.apache.org/jira/browse/TEZ-2565
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Rajesh Balamohan
> Assignee: Rajesh Balamohan
> Attachments: TEZ-2565.1.patch, TEZ-2565.2.patch, TEZ-2565.3.patch,
> cpu_usage_with_patch.png, cpu_usage_without_patch.png,
> mem_usage_with_patch.png, mem_usage_without_patch.png
>
>
> constructStatistics() can be an overhead (scanning all tasks and merging
> stats) depending on the number of invocations to Vertex::getStatistics().
> Consider scanning only unfinished tasks.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)