[
https://issues.apache.org/jira/browse/HIVE-22918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Marta Kuczora updated HIVE-22918:
---------------------------------
Description:
When creating an insert-only bucketed table with 5 buckets, and we insert only
one row to this table, Hive creates empty files for the other 4 buckets. This
logic is in the code for ACID tables as well, but when checking the table's
final directory after the insert, I found that only 1 files got created. When
debugged this issue, I found that the empty files are created in the staging
directory outside the delta directory, therefore they won't get copied by the
move task to the final directory. This behavior seems broken, but not sure if
we really need the empty files in this case.
This Jira is about investigating whether or not we need these empty files for
ACID tables and if we do, fix the code to have them for ACID tables as well.
Repro steps:
{noformat}
create table test_mm(key int, id int) clustered by (key) into 5 buckets stored
as orc tblproperties("transactional"="true",
"transactional_properties"="insert_only");
insert into test_mm values (1,1);
{noformat}
The following files are present in the 'test_mm/delta_0000001_0000001_0000'
folder:
{noformat}
244 Feb 21 12:08 000000_0
0 Feb 21 12:08 000001_0
0 Feb 21 12:08 000002_0
0 Feb 21 12:08 000003_0
0 Feb 21 12:08 000004_0
{noformat}
{noformat}
create table test_acid(key int, id int) clustered by (key) into 5 buckets
stored as orc tblproperties("transactional"="true");
insert into test_acid values (1,1);
{noformat}
The following files are present in the 'test_acid/delta_0000001_0000001_0000'
folder:
{noformat}
1 Feb 21 12:13 _orc_acid_version
656 Feb 21 12:13 bucket_00000
{noformat}
However when stopping in the MoveTask with the debugger, it can be seen that
the staging directory contains the empty files, so they are generated. However
the 000000_0 is not a file, it is a directory which contains the delta
directory and the data file. When moving the data file to the final location,
the move task will only move the files from the delta directory, so the empty
files won't be moved.
{noformat}
[martakuczora:~/work/hive/warehouse/internal] % ll
test_acid/.hive-staging_hive_2020-02-21_12-16-58_615_787573577176141305-1/-ext-10000
96 Feb 21 12:17 000000_0
0 Feb 21 12:17 000001_0
0 Feb 21 12:17 000002_0
0 Feb 21 12:17 000003_0
0 Feb 21 12:17 000004_0
{noformat}
{noformat}
ll
test_acid/.hive-staging_hive_2020-02-21_12-16-58_615_787573577176141305-1/-ext-10000/000000_0/delta_0000001_0000001_0000
1 Feb 21 12:17 _orc_acid_version
656 Feb 21 12:17 bucket_00000
{noformat}
was:
When creating an insert-only bucketed table with 5 buckets, and we insert only
one row to this table, Hive creates empty files for the other 4 buckets. This
logic is in the code for ACID tables as well, but when checking the table's
final directory after the insert, I found that only 1 files got created. When
debugged this issue, I found that the empty files are created in the staging
directory outside the delta directory, therefore they won't get copied by the
move task to the final directory. This behavior seems broken, but not sure if
we really need the empty files in this case.
This Jira is about investigating whether or not we need these empty files for
ACID tables and if we do, fix the code to have them for ACID tables as well.
> Investigate empty bucket file creation for ACID tables
> ------------------------------------------------------
>
> Key: HIVE-22918
> URL: https://issues.apache.org/jira/browse/HIVE-22918
> Project: Hive
> Issue Type: Task
> Affects Versions: 4.0.0
> Reporter: Marta Kuczora
> Assignee: Marton Bod
> Priority: Major
>
> When creating an insert-only bucketed table with 5 buckets, and we insert
> only one row to this table, Hive creates empty files for the other 4 buckets.
> This logic is in the code for ACID tables as well, but when checking the
> table's final directory after the insert, I found that only 1 files got
> created. When debugged this issue, I found that the empty files are created
> in the staging directory outside the delta directory, therefore they won't
> get copied by the move task to the final directory. This behavior seems
> broken, but not sure if we really need the empty files in this case.
> This Jira is about investigating whether or not we need these empty files for
> ACID tables and if we do, fix the code to have them for ACID tables as well.
> Repro steps:
> {noformat}
> create table test_mm(key int, id int) clustered by (key) into 5 buckets
> stored as orc tblproperties("transactional"="true",
> "transactional_properties"="insert_only");
> insert into test_mm values (1,1);
> {noformat}
> The following files are present in the 'test_mm/delta_0000001_0000001_0000'
> folder:
> {noformat}
> 244 Feb 21 12:08 000000_0
> 0 Feb 21 12:08 000001_0
> 0 Feb 21 12:08 000002_0
> 0 Feb 21 12:08 000003_0
> 0 Feb 21 12:08 000004_0
> {noformat}
> {noformat}
> create table test_acid(key int, id int) clustered by (key) into 5 buckets
> stored as orc tblproperties("transactional"="true");
> insert into test_acid values (1,1);
> {noformat}
> The following files are present in the 'test_acid/delta_0000001_0000001_0000'
> folder:
> {noformat}
> 1 Feb 21 12:13 _orc_acid_version
> 656 Feb 21 12:13 bucket_00000
> {noformat}
> However when stopping in the MoveTask with the debugger, it can be seen that
> the staging directory contains the empty files, so they are generated.
> However the 000000_0 is not a file, it is a directory which contains the
> delta directory and the data file. When moving the data file to the final
> location, the move task will only move the files from the delta directory, so
> the empty files won't be moved.
> {noformat}
> [martakuczora:~/work/hive/warehouse/internal] % ll
> test_acid/.hive-staging_hive_2020-02-21_12-16-58_615_787573577176141305-1/-ext-10000
>
> 96 Feb 21 12:17 000000_0
> 0 Feb 21 12:17 000001_0
> 0 Feb 21 12:17 000002_0
> 0 Feb 21 12:17 000003_0
> 0 Feb 21 12:17 000004_0
> {noformat}
> {noformat}
> ll
> test_acid/.hive-staging_hive_2020-02-21_12-16-58_615_787573577176141305-1/-ext-10000/000000_0/delta_0000001_0000001_0000
>
> 1 Feb 21 12:17 _orc_acid_version
> 656 Feb 21 12:17 bucket_00000
> {noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)