[
https://issues.apache.org/jira/browse/TEZ-3209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Siddharth Seth updated TEZ-3209:
--------------------------------
Target Version/s: 0.9.0
> Support for fair custom data routing
> ------------------------------------
>
> Key: TEZ-3209
> URL: https://issues.apache.org/jira/browse/TEZ-3209
> Project: Apache Tez
> Issue Type: New Feature
> Reporter: Ming Ma
> Assignee: Ming Ma
> Attachments: TEZ-3209.patch, Tez-based demuxer for highly skewed
> category data.pdf
>
>
> This is based on offline discussion with [~gopalv], [~hitesh],
> [~jrottinghuis] and [~lohit] w.r.t. the support for efficient processing of
> highly skewed unordered partitioned mapper output. Our use case is to demux
> highly skewed unordered category data partitioned by category name. Gopal and
> Hitesh mentioned dynamically shuffled join scenario.
> One option we discussed is to leverage auto-parallelism feature with upfront
> over-partitioning. That means possible overhead to support large number
> partitions and unnecessary data movement as each reducer needs to get data
> from all mappers.
> Another alternative is to use custom {{DataMovementType}} which doesn't
> require each reducer to fetch data from all mappers. In that way, a large
> partition will be processed by several reducers, each of which will fetch
> data from a portion of mappers.
> For example, say there are 100 mappers each of which has 10 partitions (P1,
> ..., P10). Each mapper generates 100MB for its P10 and 1MB for each of its
> (P1, ... P9). The default SCATTER_GATHER routing means the reducer for P10
> has to process 10GB of input and becomes the bottleneck of the job. With the
> fair custom data routing, The P10 belonging to the first 10 mappers will be
> processed by one reducer with 1GB input data. The P10 belonging to the second
> 10 mappers will be processed by another reducer, etc.
> For further optimization, we can allocate the reducer on the same nodes as
> the mappers that it fetches data from.
> To support this, we need TEZ-3206 as well as customized data routing based on
> {{VertexManagerPlugin}} and {{EdgeManagerPluginOnDemand}}.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)