[
https://issues.apache.org/jira/browse/HIVE-22969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Marta Kuczora reassigned HIVE-22969:
------------------------------------
Assignee: (was: Marta Kuczora)
> Union remove optimisation results incorrect data when inserting to ACID table
> -----------------------------------------------------------------------------
>
> Key: HIVE-22969
> URL: https://issues.apache.org/jira/browse/HIVE-22969
> Project: Hive
> Issue Type: Bug
> Affects Versions: 4.0.0
> Reporter: Marta Kuczora
> Priority: Major
>
> Steps to reproduce the issue:
> {noformat}
> create table input_text(key string, val string) stored as textfile location
> '/Users/martakuczora/work/hive/warehouse/external/input_text';
> create table output_acid(key string, val string) stored as orc
> tblproperties('transactional'='true');
> insert into input_text values ('1','1'), ('2','2'),('3','3');
> {noformat}
> {noformat}
> set hive.mapred.mode=nonstrict;
> set hive.stats.autogather=false;
> set hive.optimize.union.remove=true;
> set hive.auto.convert.join=true;
> set hive.exec.submitviachild=false;
> set hive.exec.submit.local.task.via.child=false;
> SELECT * FROM (
> select key, val from input_text
> union all
> select a.key as key, b.val as val FROM input_text a join input_text b on
> a.key=b.key) c;
> The result of the select:
> 1 1
> 2 2
> 3 3
> 1 1
> 2 2
> 3 3
> {noformat}
> {noformat}
> insert into table output_acid
> SELECT * FROM (
> select key, val from input_text
> union all
> select a.key as key, b.val as val FROM input_text a join input_text b on
> a.key=b.key) c;
> select * from output_acid;
> The result:
> 1 1
> 2 2
> 3 3
> {noformat}
> The folder of the output_acid table contained the following delta directories:
> {noformat}
> drwxr-xr-x 6 martakuczora staff 192 Mar 2 16:29 delta_0000000_0000000
> drwxr-xr-x 6 martakuczora staff 192 Mar 2 16:29 delta_0000001_0000001_0001
> {noformat}
> It can be seen that the statement ID from the first directory is missing and
> when the select statements runs on the table, this directory will be ignored.
> That's why only half of the data got returned when running the select on the
> output_acid table.
> If either hive.stats.autogather is set to true or hive.optimize.union.remove
> is set to false the result of the insert will be correct. In this case there
> will be only 1 delta directory in the table's folder.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)