[ 
https://issues.apache.org/jira/browse/TAJO-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560805#comment-14560805
 ] 

ASF GitHub Bot commented on TAJO-1553:
--------------------------------------

Github user asfgit closed the pull request at:

    https://github.com/apache/tajo/pull/583


> Improve broadcast join planning
> -------------------------------
>
>                 Key: TAJO-1553
>                 URL: https://issues.apache.org/jira/browse/TAJO-1553
>             Project: Tajo
>          Issue Type: Improvement
>          Components: distributed query plan, planner/optimizer
>            Reporter: Jihoon Son
>            Assignee: Jihoon Son
>             Fix For: 0.11.0
>
>
> The global engine generates a logical plan, and then marks some parts of the 
> plan as broadcast plan which means that they and their input will be 
> broadcasted to all workers. 
> Currently, broadcast parts are identified according to some rigid and 
> hard-coded rules. This will limit the broadcast opportunities in many cases.
> So, in this issue, I propose refactoring the broadcast planner to be more 
> general.
> Broadcast parts can be identified recursively.
> * A leaf node will be broadcasted if its input size does not exceed the 
> pre-defined threshold.
> * An intermediate node will be broadcasted if it has at least one broadcast 
> child.
> * For outer joins, row-preserved tables must not be broadcasted to avoid 
> input data duplication.
> * Given an EB containing a join and its child EBs, those EBs can be merged 
> into a single EB if at least one child EB's output is broadcastable.
> * Given a user-defined threshold, the total size of broadcast relations of an 
> EB cannot exceed such threshold.
> ** After merging EBs according to the first rule, the result EB may not 
> satisfy the second rule. In this case, enforce repartition join for large 
> relations to satisfy the second rule.
> * Preserved-row relations cannot be broadcasted to avoid duplicated results. 
> That is, full outer join cannot be executed with broadcast join.
> ** Here is brief backgrounds for this rule. Data of preserved-row relations 
> will be appeared in the join result regardless of join conditions. If 
> multiple tasks execute outer join with broadcasted preserved-row relations, 
> they emit duplicates results.
> ** Even though a single task can execute outer join when every input is 
> broadcastable, broadcast join is not allowed if one of input relation 
> consists of multiple files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to