[ 
https://issues.apache.org/jira/browse/TAJO-1766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14709134#comment-14709134
 ] 

ASF GitHub Bot commented on TAJO-1766:
--------------------------------------

Github user jihoonson commented on the pull request:

    https://github.com/apache/tajo/pull/706#issuecomment-134164003
  
    This patch is ready for review. Sorry for a large patch. 
    
    Here are some highlights of changes.
    * Added a session variable to set the limitation of broadcast table size 
for cross join. This value is valid only when ```TEST_BROADCAST_JOIN_ENABLED``` 
is set.
    * Cross join is always executed with broadcast join. To do so, at least one 
input of cross join should be the relation which is smaller than 
```BROADCAST_CROSS_JOIN_THRESHOLD```.
    * Added ```PostLogicalPlanVerifier``` to verify that cross join is 
executable or not.
    * Fixed some bugs in BroadcastJoinRule.
    * Fixed some bugs in QueryTestCaseBase.
    * Removed BNL and NL join executors. Instead, each task executes cross join 
with hash join. This is because one of inputs of cross join is always cached in 
the broadcast cache holder.
    * Improved unique key generation for a scan executor in 
```TaskAttemptContext```.
    
    I've tested cross join performance with a cluster consisting of a master 
and 5 workers. Each worker equips 48 cores, 80 GB memory, and 24 disks. 
    
    #### Data
    * partsupp: 80000000 rows (12.2 GB in TEXT file)
    * supplier_small: 100000 rows (14.1 MB in TEXT file)
    
    #### Query
    ```
    select count(*) from partsupp, supplier_small
    ```
    
    #### Result
    1 hrs, 14 mins, 33 sec is taken with this patch. The above query runs 
forever without this patch because a single worker executes cross join.



> Improve the performance of cross join
> -------------------------------------
>
>                 Key: TAJO-1766
>                 URL: https://issues.apache.org/jira/browse/TAJO-1766
>             Project: Tajo
>          Issue Type: Improvement
>          Components: distributed query plan
>            Reporter: Jihoon Son
>            Assignee: Jihoon Son
>             Fix For: 0.11.0
>
>
> Cross join is one of the very heavy operations. Furthermore, this operator is 
> performed by a single worker in the current implementation. (Please see the 
> implementation of HashPartitioner. If partitionKeyIds is empty, 
> getPartition() always returns a single value.)
> One possible alternative is executing cross join with broadcast join. That 
> is, outer table (smaller one) is always broadcasted, and join is performed by 
> the machine who stores a part of inner table.
> To do so, a new session variable is required to set the broadcast threshold 
> for cross join. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to