[
https://issues.apache.org/jira/browse/HIVE-28213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842956#comment-17842956
]
Krisztian Kasa commented on HIVE-28213:
---------------------------------------
IMHO {{hive.tez.bucket.pruning}} shouldn't be allowed while scanning external
tables. It relies on the filenames:
[https://github.com/apache/hive/blob/636b0d3abf00afbe2cf71dc89f762acca48867ca/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/HiveSplitGenerator.java#L339]
[https://github.com/apache/hive/blob/636b0d3abf00afbe2cf71dc89f762acca48867ca/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L2077]
Datafiles belongs to external tables are allowed to changed without Hive hence
this optimization can lead to data correctness issues.
> Incorrect results after insert-select from similar bucketed source & target
> table
> ---------------------------------------------------------------------------------
>
> Key: HIVE-28213
> URL: https://issues.apache.org/jira/browse/HIVE-28213
> Project: Hive
> Issue Type: Bug
> Reporter: Naresh P R
> Assignee: Zoltán Rátkai
> Priority: Major
> Attachments: test.q
>
>
> Insert-select is not honoring bucketing if both source & target are bucketed
> on same column.
> eg.,
> {code:java}
> CREATE EXTERNAL TABLE bucketing_table1 (id INT)
> CLUSTERED BY (id)
> SORTED BY (id ASC)
> INTO 32 BUCKETS stored as textfile;
> INSERT INTO TABLE bucketing_table1 VALUES (1), (2), (3), (4), (5);
> CREATE EXTERNAL TABLE bucketing_table2 like bucketing_table1;
> INSERT INTO TABLE bucketing_table2 select * from bucketing_table1;{code}
> id=1 => murmur_hash(1) %32 should go to 29th bucket file.
> bucketing_table1 has id=1 at 29th file,
> but bucketing_table2 doesn't have 29th file because Insert-select dint honor
> the bucketing.
> {code:java}
> SELECT count(*) FROM bucketing_table1 WHERE id = 1;
> ===
> 1 //correct result
> SELECT count(*) FROM bucketing_table2 WHERE id = 1;
> ===
> 0 // incorrect result
> select *, INPUT__FILE__NAME from bucketing_table1;
> +----------------------+----------------------------------------------------+
> | bucketing_table1.id | input__file__name |
> +----------------------+----------------------------------------------------+
> | 2 | <warehouse>/bucketing_table1/000004_0 |
> | 3 | <warehouse>/bucketing_table1/000006_0 |
> | 5 | <warehouse>/bucketing_table1/000015_0 |
> | 4 | <warehouse>/bucketing_table1/000021_0 |
> | 1 | <warehouse>/bucketing_table1/000029_0 |
> +----------------------+----------------------------------------------------+
> select *, INPUT__FILE__NAME from bucketing_table2;
> +-------------+----------------------------------------------------+
> | bucketing_table2.id | input__file__name |
> +-------------+----------------------------------------------------+
> | 2 | <warehouse>/bucketing_table2/000000_0 |
> | 3 | <warehouse>/bucketing_table2/000001_0 |
> | 5 | <warehouse>/bucketing_table2/000002_0 |
> | 4 | <warehouse>/bucketing_table2/000003_0 |
> | 1 | <warehouse>/bucketing_table2/000004_0 |
> +----------------------+----------------------------------------------------+{code}
> Workaround for read: hive.tez.bucket.pruning=false;
> PS: Attaching repro file [^test.q]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)