[
https://issues.apache.org/jira/browse/TEZ-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zhiyuan Yang updated TEZ-2104:
------------------------------
Comment: was deleted
(was: bq. TEZ-3209 would be good to have as something that could be used in the
shuffle edge or cross edge. Wondering if its related to the cross edge idea of
having filters that prune unwanted partitions, in the sense of removing or
merging partitions - ie logical partition management, seems to be a unifying
idea between them.
It's unrelated to the filter things. TEZ-3209 is mostly about using more than
one task handle large partition in skewed dataset. Filter is about handling the
combination of partition. However, the problem that TEZ-3209 wants to solve
also exists in cross-product edge.
bq. On that note, if we figure out that some partitions are not needed then
will we not create any tasks for them? I.e. this information is calculated up
front before determining tasks? Is this available statically at compile time
(provided to VM) or needed runtime information (calculated in VM)?
Whether a partition will be generated is decided by source vertices which we
don't intrude, and whether a partition is decided by user-provided filter. That
means it's totally decided by user.)
> A CrossProductEdge which produces synthetic cross-product parallelism
> ---------------------------------------------------------------------
>
> Key: TEZ-2104
> URL: https://issues.apache.org/jira/browse/TEZ-2104
> Project: Apache Tez
> Issue Type: New Feature
> Reporter: Gopal V
> Assignee: Zhiyuan Yang
> Labels: gsoc, gsoc2015, hadoop, hive, java, tez
> Attachments: Cartesian product edge design.2.pdf, Cross product edge
> design.pdf
>
>
> Instead of producing duplicate data for the synthetic cross-product, to fit
> into partitions, the amount of net IO can be vastly reduced by a special
> purpose cross-product data movement edge.
> The Shuffle edge routes each partition's output to a single reducer, while
> the cross-product edge routes it into a matrix of reducers without actually
> duplicating the disk data.
> A partitioning scheme with 3 partitions on the lhs and rhs of a join
> operation can be routed into 9 reducers by performing a cross-product similar
> to
> (1,2,3) x (a,b,c) = [(1,a), (1,b), (1,c), (2,a), (2,b) ...]
> This turns a single task cross-product model into a distributed cross product.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)