[ 
https://issues.apache.org/jira/browse/TEZ-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14600198#comment-14600198
 ] 

Bikas Saha commented on TEZ-2565:
---------------------------------

1) the current release of Tez returns all stats (running/completed) as far as 
the API is concerned. Hence, changing the API semantics to say that it only 
returns completed task stats is an incompatible change. It should be recorded 
as such. Alternatively, create a new API that allows specifying what states are 
desired and returning those. We can punt on the running tasks for now, but my 
guess is that we will need to fix that soon with pipelined shuffle events. 
Stats would be available much earlier than the completion time of the task, 
specially for skewed tasks, which are the main optimization targets for all 
this work. (In fact, even without pipelining this could be done by simply 
reporting stats from the outputs more often than is currently being done). The 
earlier we can detect such skews the better. Since this jira is looking at 
re-working this code path, ideally we should be looking at fixing this broadly 
for the anticipated use cases. However, I am fine if we choose punt it for 
later.
2) On the patch itself, reset of the cache upon task failures should prevent 
the need for checking null all the time. Not sure why constructStats() is 
merging the stats from each task every time. This merge could happen in 
TaskCompletedTransition and the getStats() could simply return construcStats(). 
When the cache is invalidated upon re-run, then the transition could reset the 
cache and re-populate the completed tasks within that transition. Given the 
above, its unclear, why we are seeing reduction in CPU times. Perhaps that is 
due to ignoring running tasks and not due to any optimization for the completed 
tasks because those seem to be merged every time.

> Consider scanning unfinished tasks in VertexImpl::constructStatistics to 
> reduce merge overhead
> ----------------------------------------------------------------------------------------------
>
>                 Key: TEZ-2565
>                 URL: https://issues.apache.org/jira/browse/TEZ-2565
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>         Attachments: TEZ-2565.1.patch, TEZ-2565.2.patch, TEZ-2565.3.patch, 
> cpu_usage_with_patch.png, cpu_usage_without_patch.png, 
> mem_usage_with_patch.png, mem_usage_without_patch.png
>
>
> constructStatistics() can be an overhead (scanning all tasks and merging 
> stats) depending on the number of invocations to Vertex::getStatistics().  
> Consider scanning only unfinished tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to