[ 
https://issues.apache.org/jira/browse/HIVE-12762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15085826#comment-15085826
 ] 

Aihua Xu commented on HIVE-12762:
---------------------------------

We have two issues: 1. We are filtering the parquet columns based on the last 
filter condition in the query. So if the query contains multiple instances of 
the same table, e.g., join on the same table with different filter conditions, 
then we could get incorrect result; 2. rewriteLeaves implementation in 
SearchArgumentImpl is not accurate since the different leaves could be sharing 
the same object. The current implementation could change the leave index 
multiple times to an incorrect value.

The patch will merge all the filter conditions (create OR expression on all the 
filters) so that the columns which will be used during operator won't be 
filtered during earlier splitting stage. rewriteLeaves is reimplemented to get 
all the unique leaves first and replace in place.

[~xuefuz] [~spena] Can you help review the code?

> Common join on parquet tables returns incorrect result when 
> hive.optimize.index.filter set to true
> --------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-12762
>                 URL: https://issues.apache.org/jira/browse/HIVE-12762
>             Project: Hive
>          Issue Type: Bug
>          Components: Logical Optimizer
>    Affects Versions: 2.1.0
>            Reporter: Aihua Xu
>            Assignee: Aihua Xu
>         Attachments: HIVE-12762.2.patch, HIVE-12762.patch
>
>
> The following query will give incorrect result.
> {noformat}
> CREATE TABLE tbl1(id INT) STORED AS PARQUET;
> INSERT INTO tbl1 VALUES(1), (2);
> CREATE TABLE tbl2(id INT, value STRING) STORED AS PARQUET;
> INSERT INTO tbl2 VALUES(1, 'value1');
> INSERT INTO tbl2 VALUES(1, 'value2');
> set hive.optimize.index.filter = true;
> set hive.auto.convert.join=false;
> select tbl1.id, t1.value, t2.value
> FROM tbl1
> JOIN (SELECT * FROM tbl2 WHERE value='value1') t1 ON tbl1.id=t1.id
> JOIN (SELECT * FROM tbl2 WHERE value='value2') t2 ON tbl1.id=t2.id;
> {noformat}
> We are enforcing to use common join and tbl2 will have 2 files after 2 
> insertions underneath.
> the map job contains 3 TableScan operators (2 for tbl2 and 1 for tbl1). When  
>   hive.optimize.index.filter is set to true, we are incorrectly applying the 
> later filtering condition to each block, which causes no data is returned for 
> the subquery {{SELECT * FROM tbl2 WHERE value='value1'}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to