[ 
https://issues.apache.org/jira/browse/TAJO-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jihoon Son updated TAJO-1553:
-----------------------------
    Description: 
The global engine generates a logical plan, and then marks some parts of the 
plan as broadcast plan which means that they and their input will be 
broadcasted to all workers. 

Currently, broadcast parts are identified according to some rigid and 
hard-coded rules. This will limit the broadcast opportunities in many cases.
So, in this issue, I propose refactoring the broadcast planner to be more 
general.

Broadcast parts can be identified recursively.
* A leaf node will be broadcasted if its input size does not exceed the 
pre-defined threshold.
* An intermediate node will be broadcasted if it has at least one broadcast 
child.
* For outer joins, row-preserved tables must not be broadcasted to avoid input 
data duplication.

* Given an EB containing a join and its child EBs, those EBs can be merged into 
a single EB if at least one child EB's output is broadcastable.
* Given a user-defined threshold, the total size of broadcast relations of an 
EB cannot exceed such threshold.
** After merging EBs according to the first rule, the result EB may not satisfy 
the second rule. In this case, enforce repartition join for large relations to 
satisfy the second rule.
* Preserved-row relations cannot be broadcasted to avoid duplicated results. 
That is, full outer join cannot be executed with broadcast join.
** Here is brief backgrounds for this rule. Data of preserved-row relations 
will be appeared in the join result regardless of join conditions. If multiple 
tasks execute outer join with broadcasted preserved-row relations, they emit 
duplicates results.
** Even though a single task can execute outer join when every input is 
broadcastable, broadcast join is not allowed if one of input relation consists 
of multiple files.

  was:
The global engine generates a logical plan, and then marks some parts of the 
plan as broadcast plan which means that they and their input will be 
broadcasted to all workers. 

Currently, broadcast parts are identified according to some rigid and 
hard-coded rules. This will limit the broadcast opportunities in many cases.
So, in this issue, I propose refactoring the broadcast planner to be more 
general.

Broadcast parts can be identified recursively.
* A leaf node will be broadcasted if its input size does not exceed the 
pre-defined threshold.
* An intermediate node will be broadcasted if it has at least one broadcast 
child.
* For outer joins, row-preserved tables must not be broadcasted to avoid input 
data duplication.


> Improve broadcast join planning
> -------------------------------
>
>                 Key: TAJO-1553
>                 URL: https://issues.apache.org/jira/browse/TAJO-1553
>             Project: Tajo
>          Issue Type: Improvement
>          Components: distributed query plan, planner/optimizer
>            Reporter: Jihoon Son
>            Assignee: Jihoon Son
>             Fix For: 0.11.0
>
>
> The global engine generates a logical plan, and then marks some parts of the 
> plan as broadcast plan which means that they and their input will be 
> broadcasted to all workers. 
> Currently, broadcast parts are identified according to some rigid and 
> hard-coded rules. This will limit the broadcast opportunities in many cases.
> So, in this issue, I propose refactoring the broadcast planner to be more 
> general.
> Broadcast parts can be identified recursively.
> * A leaf node will be broadcasted if its input size does not exceed the 
> pre-defined threshold.
> * An intermediate node will be broadcasted if it has at least one broadcast 
> child.
> * For outer joins, row-preserved tables must not be broadcasted to avoid 
> input data duplication.
> * Given an EB containing a join and its child EBs, those EBs can be merged 
> into a single EB if at least one child EB's output is broadcastable.
> * Given a user-defined threshold, the total size of broadcast relations of an 
> EB cannot exceed such threshold.
> ** After merging EBs according to the first rule, the result EB may not 
> satisfy the second rule. In this case, enforce repartition join for large 
> relations to satisfy the second rule.
> * Preserved-row relations cannot be broadcasted to avoid duplicated results. 
> That is, full outer join cannot be executed with broadcast join.
> ** Here is brief backgrounds for this rule. Data of preserved-row relations 
> will be appeared in the join result regardless of join conditions. If 
> multiple tasks execute outer join with broadcasted preserved-row relations, 
> they emit duplicates results.
> ** Even though a single task can execute outer join when every input is 
> broadcastable, broadcast join is not allowed if one of input relation 
> consists of multiple files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to