[ 
https://issues.apache.org/jira/browse/HIVE-22918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marton Bod resolved HIVE-22918.
-------------------------------
    Resolution: Done

> Investigate empty bucket file creation for ACID tables
> ------------------------------------------------------
>
>                 Key: HIVE-22918
>                 URL: https://issues.apache.org/jira/browse/HIVE-22918
>             Project: Hive
>          Issue Type: Task
>    Affects Versions: 4.0.0
>            Reporter: Marta Kuczora
>            Assignee: Marton Bod
>            Priority: Major
>
> When creating an insert-only bucketed table with 5 buckets, and we insert 
> only one row to this table, Hive creates empty files for the other 4 buckets. 
> This logic is in the code for ACID tables as well, but when checking the 
> table's final directory after the insert, I found that only 1 files got 
> created. When debugged this issue, I found that the empty files are created 
> in the staging directory outside the delta directory, therefore they won't 
> get copied by the move task to the final directory. This behavior seems 
> broken, but not sure if we really need the empty files in this case.
> This Jira is about investigating whether or not we need these empty files for 
> ACID tables and if we do, fix the code to have them for ACID tables as well.
> Repro steps: 
> {noformat}
> create table test_mm(key int, id int) clustered by (key) into 5 buckets 
> stored as orc tblproperties("transactional"="true", 
> "transactional_properties"="insert_only");
> insert into test_mm values (1,1);
> {noformat}
> The following files are present in the 'test_mm/delta_0000001_0000001_0000' 
> folder:
> {noformat}
> 244 Feb 21 12:08 000000_0
>   0 Feb 21 12:08 000001_0
>   0 Feb 21 12:08 000002_0
>   0 Feb 21 12:08 000003_0
>   0 Feb 21 12:08 000004_0
> {noformat}
> {noformat}
> create table test_acid(key int, id int) clustered by (key) into 5 buckets 
> stored as orc tblproperties("transactional"="true");
> insert into test_acid values (1,1);
> {noformat}
> The following files are present in the 'test_acid/delta_0000001_0000001_0000' 
> folder:
> {noformat}
>   1 Feb 21 12:13 _orc_acid_version
> 656 Feb 21 12:13 bucket_00000
> {noformat}
> However when stopping in the MoveTask with the debugger, it can be seen that 
> the staging directory contains the empty files, so they are generated. 
> However the 000000_0 is not a file, it is a directory which contains the 
> delta directory and the data file. When moving the data file to the final 
> location, the move task will only move the files from the delta directory, so 
> the empty files won't be moved.
> {noformat}
> ll 
> test_acid/.hive-staging_hive_2020-02-21_12-16-58_615_787573577176141305-1/-ext-10000
>  
> 96 Feb 21 12:17 000000_0
> 0 Feb 21 12:17 000001_0
> 0 Feb 21 12:17 000002_0
> 0 Feb 21 12:17 000003_0
> 0 Feb 21 12:17 000004_0
> {noformat}
> {noformat}
> ll 
> test_acid/.hive-staging_hive_2020-02-21_12-16-58_615_787573577176141305-1/-ext-10000/000000_0/delta_0000001_0000001_0000
>  
>   1 Feb 21 12:17 _orc_acid_version
> 656 Feb 21 12:17 bucket_00000
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to