[
https://issues.apache.org/jira/browse/TEZ-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14238761#comment-14238761
]
Rajesh Balamohan commented on TEZ-1610:
---------------------------------------
[~sseth], posting more thoughts on this
Objective of these counters is to gain more information on how shuffle is
performing in each task.
1. In every task, find out when fetcher completed pulling data from different
sources (i.e absolute timestamp). With this, we should be able to find out the
total time taken for shuffle (events, merging, shuffle, local fetch) at every
source. We can have SHUFFLE_FINISH_TIME for tracking this.
2. In every task, find out exact amount of time spent in pulling the data over
the wire (need not include local disk copy optimization). This can be slightly
tricky as many fetcher threads are involved in pulling the data (e.g it is
quite possible that only one fetcher is pulling the data and rest are idle).
Instead, it would be good to represent it as percentage of the amount of time
fetchers spent in
pulling the data over the wire. I.e, SHUFFLE_TIME_AS_PERCENTAGE = (cumulative
time taken for pulling the data over wire by all fetchers) /
(TEZ_RUNTIME_SHUFFLE_PARALLEL_COPIES * (shuffle runtime ))), where "shuffle
runtime" is the observed runtime in Shuffle.java (before shutting down
fetchers).
3. In every task, find out whether shuffle got delayed due to event arrivals
from source. This can again be represented as a percentage of overall shuffle
phase (i.e), SHUFFLE_LAST_EVENT_ARRIVAL_PERCENTAGE = (shuffle end time -
last_event_arrival_time) / (shuffle end time). If the percentage is more, it
would mean that the shuffle got delayed due to event arrival from source.
4. If this is useful, we can follow the same percentage approach for getting
the in-memory merge timings as well. i.e (cumulative time taken for in-memory
merge in all fetchers) / (TEZ_RUNTIME_SHUFFLE_PARALLEL_COPIES * (shuffle
runtime ))), where "shuffle runtime" is the observed runtime in Shuffle.java
(before shutting down fetchers).
5. MERGED_INPUT_READY_DELTA can be an additional counter which would provide
details on the time spent in closing the merger.
Please let me know your thoughts. Also, plz ignore the naming conventions for
the counters listed above (we can come up with better names once the list is
finalized).
> additional task counters for fetchers
> -------------------------------------
>
> Key: TEZ-1610
> URL: https://issues.apache.org/jira/browse/TEZ-1610
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Rajesh Balamohan
> Assignee: Rajesh Balamohan
> Attachments: TEZ-1610.1.patch
>
>
> - ShuffleFinishTime (per source)
> - Merge time (depending on broadcast/scatter-gather shuffle)
> This would be helpful in determining when shuffle started/ended for different
> sources in a task.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)