[ 
https://issues.apache.org/jira/browse/TEZ-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhiyuan Yang updated TEZ-2104:
------------------------------
    Comment: was deleted

(was: bq. TEZ-3209 would be good to have as something that could be used in the 
shuffle edge or cross edge. Wondering if its related to the cross edge idea of 
having filters that prune unwanted partitions, in the sense of removing or 
merging partitions - ie logical partition management, seems to be a unifying 
idea between them.
It's unrelated to the filter things. TEZ-3209 is mostly about using more than 
one task handle large partition in skewed dataset. Filter is about handling the 
combination of partition. However, the problem that TEZ-3209 wants to solve 
also exists in cross-product edge.

bq. On that note, if we figure out that some partitions are not needed then 
will we not create any tasks for them? I.e. this information is calculated up 
front before determining tasks? Is this available statically at compile time 
(provided to VM) or needed runtime information (calculated in VM)?
Whether a partition will be generated is decided by source vertices which we 
don't intrude, and whether a partition is decided by user-provided filter. That 
means it's totally decided by user.)

> A CrossProductEdge which produces synthetic cross-product parallelism
> ---------------------------------------------------------------------
>
>                 Key: TEZ-2104
>                 URL: https://issues.apache.org/jira/browse/TEZ-2104
>             Project: Apache Tez
>          Issue Type: New Feature
>            Reporter: Gopal V
>            Assignee: Zhiyuan Yang
>              Labels: gsoc, gsoc2015, hadoop, hive, java, tez
>         Attachments: Cartesian product edge design.2.pdf, Cross product edge 
> design.pdf
>
>
> Instead of producing duplicate data for the synthetic cross-product, to fit 
> into partitions, the amount of net IO can be vastly reduced by a special 
> purpose cross-product data movement edge.
> The Shuffle edge routes each partition's output to a single reducer, while 
> the cross-product edge routes it into a matrix of reducers without actually 
> duplicating the disk data.
> A partitioning scheme with 3 partitions on the lhs and rhs of a join 
> operation can be routed into 9 reducers by performing a cross-product similar 
> to 
> (1,2,3) x (a,b,c) = [(1,a), (1,b), (1,c), (2,a), (2,b) ...]
> This turns a single task cross-product model into a distributed cross product.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to