[ 
https://issues.apache.org/jira/browse/TAJO-680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273241#comment-14273241
 ] 

Jihoon Son commented on TAJO-680:
---------------------------------

There are a lot of query optimization issues in the _IN subquery_ statement.
However, it would be better to implement a straightforward way, and then make 
some improvements on it.

The straightforward way is the two-phase execution. Here is an overall idea.
* In the first phase, the subquery statement (in the above example, select 
r_regionkey from region) is executed. At the end of the first phase, each 
worker writes the intermediate results on its own local disks. 
* These intermediate results should be sent to the workers of the second phase. 
Here can be two choices. If the size of the intermediate results is 
sufficiently small, the whole results are broadcasted to all workers of the 
second phase. Otherwise, the intermediate results and the input relation of the 
second phase must be shuffled according to the values of the key (regionkey). 
* Finally, the remaining part (select * from nation where n_regionkey in 
(results of the first phase)) is executed in the second phase. 

The most challenging point in the query planning is how to represent a result 
set of a subquery as an EvalNode.
As a solution of this challenge, I would like to introduce a new EvalNode, 
called RowConstantFromDiskEval, which reads a list of values stored on disks. 
During query execution, RowConstantFromDiskEval reads the shuffled intermediate 
results, and thus InEval can be evaluated.

> Improve the IN operator to support sub queries
> ----------------------------------------------
>
>                 Key: TAJO-680
>                 URL: https://issues.apache.org/jira/browse/TAJO-680
>             Project: Tajo
>          Issue Type: Improvement
>          Components: distributed query plan, parser
>            Reporter: Jihoon Son
>            Assignee: Jihoon Son
>             Fix For: 0.11
>
>
> Currently, the IN operator can be used with only sets of values.
> We need to improve it to support sub queries as the following example query.
> {noformat}
> tajo> select * from nation where n_regionkey in (select r_regionkey from 
> region); 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to