[
https://issues.apache.org/jira/browse/HIVE-22918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Marton Bod resolved HIVE-22918.
-------------------------------
Resolution: Done
> Investigate empty bucket file creation for ACID tables
> ------------------------------------------------------
>
> Key: HIVE-22918
> URL: https://issues.apache.org/jira/browse/HIVE-22918
> Project: Hive
> Issue Type: Task
> Affects Versions: 4.0.0
> Reporter: Marta Kuczora
> Assignee: Marton Bod
> Priority: Major
>
> When creating an insert-only bucketed table with 5 buckets, and we insert
> only one row to this table, Hive creates empty files for the other 4 buckets.
> This logic is in the code for ACID tables as well, but when checking the
> table's final directory after the insert, I found that only 1 files got
> created. When debugged this issue, I found that the empty files are created
> in the staging directory outside the delta directory, therefore they won't
> get copied by the move task to the final directory. This behavior seems
> broken, but not sure if we really need the empty files in this case.
> This Jira is about investigating whether or not we need these empty files for
> ACID tables and if we do, fix the code to have them for ACID tables as well.
> Repro steps:
> {noformat}
> create table test_mm(key int, id int) clustered by (key) into 5 buckets
> stored as orc tblproperties("transactional"="true",
> "transactional_properties"="insert_only");
> insert into test_mm values (1,1);
> {noformat}
> The following files are present in the 'test_mm/delta_0000001_0000001_0000'
> folder:
> {noformat}
> 244 Feb 21 12:08 000000_0
> 0 Feb 21 12:08 000001_0
> 0 Feb 21 12:08 000002_0
> 0 Feb 21 12:08 000003_0
> 0 Feb 21 12:08 000004_0
> {noformat}
> {noformat}
> create table test_acid(key int, id int) clustered by (key) into 5 buckets
> stored as orc tblproperties("transactional"="true");
> insert into test_acid values (1,1);
> {noformat}
> The following files are present in the 'test_acid/delta_0000001_0000001_0000'
> folder:
> {noformat}
> 1 Feb 21 12:13 _orc_acid_version
> 656 Feb 21 12:13 bucket_00000
> {noformat}
> However when stopping in the MoveTask with the debugger, it can be seen that
> the staging directory contains the empty files, so they are generated.
> However the 000000_0 is not a file, it is a directory which contains the
> delta directory and the data file. When moving the data file to the final
> location, the move task will only move the files from the delta directory, so
> the empty files won't be moved.
> {noformat}
> ll
> test_acid/.hive-staging_hive_2020-02-21_12-16-58_615_787573577176141305-1/-ext-10000
>
> 96 Feb 21 12:17 000000_0
> 0 Feb 21 12:17 000001_0
> 0 Feb 21 12:17 000002_0
> 0 Feb 21 12:17 000003_0
> 0 Feb 21 12:17 000004_0
> {noformat}
> {noformat}
> ll
> test_acid/.hive-staging_hive_2020-02-21_12-16-58_615_787573577176141305-1/-ext-10000/000000_0/delta_0000001_0000001_0000
>
> 1 Feb 21 12:17 _orc_acid_version
> 656 Feb 21 12:17 bucket_00000
> {noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)