[ 
https://issues.apache.org/jira/browse/HIVE-28213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842956#comment-17842956
 ] 

Krisztian Kasa commented on HIVE-28213:
---------------------------------------

IMHO {{hive.tez.bucket.pruning}} shouldn't be allowed while scanning external 
tables. It relies on the filenames:
[https://github.com/apache/hive/blob/636b0d3abf00afbe2cf71dc89f762acca48867ca/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/HiveSplitGenerator.java#L339]
[https://github.com/apache/hive/blob/636b0d3abf00afbe2cf71dc89f762acca48867ca/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L2077]
Datafiles belongs to external tables are allowed to changed without Hive hence 
this optimization can lead to data correctness issues.
 

> Incorrect results after insert-select from similar bucketed source & target 
> table
> ---------------------------------------------------------------------------------
>
>                 Key: HIVE-28213
>                 URL: https://issues.apache.org/jira/browse/HIVE-28213
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Naresh P R
>            Assignee: Zoltán Rátkai
>            Priority: Major
>         Attachments: test.q
>
>
> Insert-select is not honoring bucketing if both source & target are bucketed 
> on same column.
> eg., 
> {code:java}
> CREATE EXTERNAL TABLE bucketing_table1 (id INT)
> CLUSTERED BY (id)
> SORTED BY (id ASC)
> INTO 32 BUCKETS stored as textfile;
> INSERT INTO TABLE bucketing_table1 VALUES (1), (2), (3), (4), (5);
> CREATE EXTERNAL TABLE bucketing_table2 like bucketing_table1;
> INSERT INTO TABLE bucketing_table2 select * from bucketing_table1;{code}
> id=1 => murmur_hash(1) %32 should go to 29th bucket file.
> bucketing_table1 has id=1 at 29th file,
> but bucketing_table2 doesn't have 29th file because Insert-select dint honor 
> the bucketing.
> {code:java}
> SELECT count(*) FROM bucketing_table1 WHERE id = 1;
> ===
> 1 //correct result
> SELECT count(*) FROM bucketing_table2 WHERE id = 1;   
> === 
> 0 // incorrect result
> select *, INPUT__FILE__NAME from bucketing_table1;
> +----------------------+----------------------------------------------------+
> | bucketing_table1.id  |                 input__file__name                  |
> +----------------------+----------------------------------------------------+
> | 2                    | <warehouse>/bucketing_table1/000004_0 |
> | 3                    | <warehouse>/bucketing_table1/000006_0 |
> | 5                    | <warehouse>/bucketing_table1/000015_0 |
> | 4                    | <warehouse>/bucketing_table1/000021_0 |
> | 1                    | <warehouse>/bucketing_table1/000029_0 |
> +----------------------+----------------------------------------------------+
> select *, INPUT__FILE__NAME from bucketing_table2;
> +-------------+----------------------------------------------------+
> | bucketing_table2.id  |                 input__file__name                  |
> +-------------+----------------------------------------------------+
> | 2           | <warehouse>/bucketing_table2/000000_0 |
> | 3           | <warehouse>/bucketing_table2/000001_0 |
> | 5           | <warehouse>/bucketing_table2/000002_0 |
> | 4           | <warehouse>/bucketing_table2/000003_0 |
> | 1           | <warehouse>/bucketing_table2/000004_0 |
> +----------------------+----------------------------------------------------+{code}
> Workaround for read: hive.tez.bucket.pruning=false;
> PS: Attaching repro file [^test.q]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to