[
https://issues.apache.org/jira/browse/HIVE-20183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16547471#comment-16547471
]
Peter Vary commented on HIVE-20183:
-----------------------------------
[~kgyrtkirk]: Could you please review?
The thoughts behind the patch are the following:
* FileSinkOperator takes care bucketing here:
[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L692-L705]
* Essentially it changes the output buckets based on the source *fileId*
instead of *taskId*
* The fileId is set here:
[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java#L154-L167]
* This is only called when we read the first row from the table, so when no
data is in the files then it is not called at all
* We change this, by the patch to call this in the *closeOp*, if it is not set
yet.
The tests are all green. Added new tests for TestCliDriver, TestSparkCliDriver:
*
[https://builds.apache.org/job/PreCommit-HIVE-Build/12661/testReport/org.apache.hadoop.hive.cli/TestCliDriver/testCliDriver_bucket7_/]
*
[https://builds.apache.org/job/PreCommit-HIVE-Build/12661/testReport/org.apache.hadoop.hive.cli/TestSparkCliDriver/testCliDriver_bucket7_/]
I think Tez do not have to be tested, since it uses different ways to handle 0
length files.
Thanks,
Peter
> Inserting from bucketed table can cause data loss, if the source table
> contains empty buckets
> ---------------------------------------------------------------------------------------------
>
> Key: HIVE-20183
> URL: https://issues.apache.org/jira/browse/HIVE-20183
> Project: Hive
> Issue Type: Bug
> Components: Operators
> Reporter: Peter Vary
> Assignee: Peter Vary
> Priority: Major
> Attachments: HIVE-20183.2.patch, HIVE-20183.patch
>
>
> Could be reproduced by the following:
> {code}
> set hive.enforce.bucketing=true;
> set hive.enforce.sorting=true;
> set hive.optimize.bucketingsorting=true;
> create table bucket1 (id int, val string) clustered by (id) sorted by (id
> ASC) INTO 4 BUCKETS;
> insert into bucket1 values (1, 'abc'), (3, 'abc');
> select * from bucket1;
> +-------------+--------------+
> | bucket1.id | bucket1.val |
> +-------------+--------------+
> | 3 | abc |
> | 1 | abc |
> +-------------+--------------+
> create table bucket2 like bucket1;
> insert overwrite table bucket2 select * from bucket1;
> select * from bucket2;
> +-------------+--------------+
> | bucket2.id | bucket2.val |
> +-------------+--------------+
> | 1 | abc |
> +-------------+--------------+
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)