[
https://issues.apache.org/jira/browse/HIVE-23763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Marta Kuczora resolved HIVE-23763.
----------------------------------
Resolution: Fixed
> Query based minor compaction produces wrong files when rows with different
> buckets Ids are processed by the same FileSinkOperator
> ---------------------------------------------------------------------------------------------------------------------------------
>
> Key: HIVE-23763
> URL: https://issues.apache.org/jira/browse/HIVE-23763
> Project: Hive
> Issue Type: Bug
> Components: Transactions
> Affects Versions: 4.0.0
> Reporter: Marta Kuczora
> Assignee: Marta Kuczora
> Priority: Major
> Labels: pull-request-available
> Fix For: 4.0.0
>
> Time Spent: 2h 10m
> Remaining Estimate: 0h
>
> How to reproduce:
> - Create an unbucketed ACID table
> - Insert a bigger amount of data into this table so there would be multiple
> bucket files in the table
> The files in the table should look like this:
> /warehouse/tablespace/managed/hive/bubu_acid/delta_0000001_0000001_0000/bucket_00000_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_0000001_0000001_0000/bucket_00001_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_0000001_0000001_0000/bucket_00002_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_0000001_0000001_0000/bucket_00003_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_0000001_0000001_0000/bucket_00004_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_0000001_0000001_0000/bucket_00005_0
> - Do some delete on rows with different bucket Ids
> The files in a delete delta should look like this:
> /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_0000002_0000002_0000/bucket_00000
> /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_0000006_0000006_0000/bucket_00003
> /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_0000006_0000006_0000/bucket_00001
> - Run the query-based minor compaction
> - After the compaction the newly created delete delta containes only 1 bucket
> file. This file contains rows from all buckets and the table becomes unusable
> /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_0000001_0000007_v0000066/bucket_00000
> The issue happens only if rows with different bucket Ids are processed by the
> same FileSinkOperator.
> In the FileSinkOperator.process method, the files for the compaction table
> are created like this:
> {noformat}
> if (!bDynParts && !filesCreated) {
> if (lbDirName != null) {
> if (valToPaths.get(lbDirName) == null) {
> createNewPaths(null, lbDirName);
> }
> } else {
> if (conf.isCompactionTable()) {
> int bucketProperty = getBucketProperty(row);
> bucketId =
> BucketCodec.determineVersion(bucketProperty).decodeWriterId(bucketProperty);
> }
> createBucketFiles(fsp);
> }
> }
> {noformat}
> When the first row is processed, the file is created and then the
> filesCreated variable is set to true. Then when the other rows are processed,
> the first if statement will be false, so no new file gets created, but the
> row will be written into the file created for the first row.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)