Marta Kuczora created HIVE-22969:
------------------------------------

             Summary: Union remove optimisation results incorrect data when 
inserting to ACID table
                 Key: HIVE-22969
                 URL: https://issues.apache.org/jira/browse/HIVE-22969
             Project: Hive
          Issue Type: Bug
    Affects Versions: 4.0.0
            Reporter: Marta Kuczora
            Assignee: Marta Kuczora


Steps to reproduce the issue:
{noformat}
create table input_text(key string, val string) stored as textfile location 
'/Users/martakuczora/work/hive/warehouse/external/input_text';

create table output_acid(key string, val string) stored as orc 
tblproperties('transactional'='true');
insert into input_text values ('1','1'), ('2','2'),('3','3');
{noformat}
{noformat}
set hive.mapred.mode=nonstrict;
set hive.stats.autogather=false;
set hive.optimize.union.remove=true;
set hive.auto.convert.join=true;
set hive.exec.submitviachild=false;
set hive.exec.submit.local.task.via.child=false;

SELECT * FROM (
select key, val from input_text
union all
select a.key as key, b.val as val FROM input_text a join input_text b on 
a.key=b.key) c;

The result of the select:
1       1
2       2
3       3
1       1
2       2
3       3
{noformat}
{noformat}
insert into table output_acid
SELECT * FROM (
select key, val from input_text
union all
select a.key as key, b.val as val FROM input_text a join input_text b on 
a.key=b.key) c;

select * from output_acid;
The result:
1       1
2       2
3       3
{noformat}

The folder of the output_acid table contained the following delta directories:
{noformat}
drwxr-xr-x  6 martakuczora  staff  192 Mar  2 16:29 delta_0000000_0000000
drwxr-xr-x  6 martakuczora  staff  192 Mar  2 16:29 delta_0000001_0000001_0001
{noformat}
It can be seen that the statement ID from the first directory is missing and 
when the select statements runs on the table, this directory will be ignored. 
That's why only half of the data got returned when running the select on the 
output_acid table.

If either hive.stats.autogather is set to true or hive.optimize.union.remove is 
set to false the result of the insert will be correct. In this case there will 
be only 1 delta directory in the table's folder.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to