[ 
https://issues.apache.org/jira/browse/TEZ-3810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293117#comment-16293117
 ] 

Kuhu Shukla commented on TEZ-3810:
----------------------------------

I think there is a need to retract and open up the question of what we really 
want to measure here up for discussion.

1. What is defined as idle shuffle time?
    a.  Is it the time each fetcher has to wait for the input to be ready? OR
    b. Is it the time that runningFetchers are zero and pending hosts is empty 
as well? That is, as long as one fetcher is running, the shuffle process in 
general is not taken to be idle. This gets tricky if one of say x outputs from 
a given host takes a long time to finish, since pendingHosts will be non-empty 
and runningFetchers would be zero post all other fetches complete.

There are benefits to tracking the time a single fetcher is idle, telling us 
more about efficiency of thread assignment to map outputs, but it may bloat the 
value in cases where other fetches are considered as idle time for the fetcher 
thread waiting on a skewed or a straggler output.
Appreciate any thoughts by the community here. Thanks a lot!

> TezCounter for idle time in shuffle phase
> -----------------------------------------
>
>                 Key: TEZ-3810
>                 URL: https://issues.apache.org/jira/browse/TEZ-3810
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Ashwin Ramesh
>         Attachments: TEZ-3810-001.patch, TEZ-3810.002.patch, 
> TEZ-3810.003.patch, TEZ-3810.004.patch
>
>
>  A task attempt counter that tracks how much time was spent waiting for 
> inputs in the shuffle phase. We can use this to quickly identify jobs that 
> are wasting a lot of time on the grid with idle reducer tasks instead of 
> shuffling/merging.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to