[ 
https://issues.apache.org/jira/browse/TEZ-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15649164#comment-15649164
 ] 

Zhiyuan Yang commented on TEZ-2104:
-----------------------------------

I didn't find solid use case for joining partitioned data and unpartitioned 
data back to the time of investigation, although it might be possible I'm just 
not familiar enough with Hive/Pig. Not sure about Pig, but Hive integration 
(HIVE-14731) is still ongoing and takes more time in planner side change. Even 
in Hive, the integration is limited to cartesian product of unpartitioned data. 
Partitioned use case is way more complex to integrate.

> A CrossProductEdge which produces synthetic cross-product parallelism
> ---------------------------------------------------------------------
>
>                 Key: TEZ-2104
>                 URL: https://issues.apache.org/jira/browse/TEZ-2104
>             Project: Apache Tez
>          Issue Type: New Feature
>            Reporter: Gopal V
>            Assignee: Zhiyuan Yang
>              Labels: gsoc, gsoc2015, hadoop, hive, java, tez
>         Attachments: Cartesian product edge design.2.pdf, Cross product edge 
> design.pdf
>
>
> Instead of producing duplicate data for the synthetic cross-product, to fit 
> into partitions, the amount of net IO can be vastly reduced by a special 
> purpose cross-product data movement edge.
> The Shuffle edge routes each partition's output to a single reducer, while 
> the cross-product edge routes it into a matrix of reducers without actually 
> duplicating the disk data.
> A partitioning scheme with 3 partitions on the lhs and rhs of a join 
> operation can be routed into 9 reducers by performing a cross-product similar 
> to 
> (1,2,3) x (a,b,c) = [(1,a), (1,b), (1,c), (2,a), (2,b) ...]
> This turns a single task cross-product model into a distributed cross product.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to