[
https://issues.apache.org/jira/browse/TEZ-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gopal V updated TEZ-1167:
-------------------------
Attachment: tez-reducer-skew.png
We need leaky bucket stats collection for some of the more expensive metrics.
Let me pick an explicit example and deal with it
The slow-start in the presence of skews tend to show the reducers being
slow-started in order of their destination id - reducer 0 started before
reducer 1 etc...The trouble happens when the skew hits reducer 493 instead of
affecting any of the earlier ones on a cluster with 199 containers.
!tez-reducer-skew.png!
The ideal slow-start order would be to start the reducer which has a known
largest output size as each mapper time between min-src and max-src time frames.
Storing this information per-mapper might be incredibly expensive, however it
might be possible to collect & aggregate leaky bit-sets of information in the
vertex manager which can sample the quality of information lightly instead of
storing all the incoming information.
Today the only information collected is an isEmpty bitset produced along with
the DataMovementEvent, while this basic information is not read in any of the
scheduling layers.
Along with making that information accessible (in aggregate form) the
information about that reducers' output size can be sent as a 3-bit
log-base-1024 bitset instead of 1 bit each, which can be OR'd out at the vertex
manager to determine which destination to slow-start - in particularly for the
example I have listed since all reducers have input.
The vertex manager can sample that stream instead of mandatorily reading
through all the messages, which is where this takes a departure from the
standard counter aggregation.
The quality of information can be preserved in such a model to schedule
reducers more intelligently with just a minor increase of metrics traffic like
that in the data movement messages.
Right now, the hacky work around such skew mistakes in scheduling by lowering
the total reducer count till all of them can slow-start (i.e lower 1099 -> 199)
- which saves 70+ secs off the query.
That is a specific case where there is an explicit decision made incorrectly by
the scheduler due to the lack of data-quality metrics based decision making.
> Statistics infrastructure and API for Tez
> -----------------------------------------
>
> Key: TEZ-1167
> URL: https://issues.apache.org/jira/browse/TEZ-1167
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Bikas Saha
> Assignee: Bikas Saha
> Attachments: tez-reducer-skew.png
>
>
> Statistics on data (data size, distribution etc) and performance (compute
> time/CPU time etc) are important in making runtime decisions on plan
> reconfiguration. For that it would be useful to have a uniform infrastructure
> that provides this information and make it available via an API so that user
> code in vertex managers etc can make use of this data to make decisions. This
> umbrella jira tracks improvements needed in Tez to support this.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)