László Pintér created HIVE-26133:
------------------------------------

             Summary: Insert overwrite on Iceberg tables can result in 
duplicate entries after partition evolution
                 Key: HIVE-26133
                 URL: https://issues.apache.org/jira/browse/HIVE-26133
             Project: Hive
          Issue Type: Improvement
            Reporter: László Pintér
            Assignee: László Pintér


Insert overwrite commands in Hive only rewrite partitions affected by the query.

If we write out a record with specA (e.g. day(ts)), resulting in a datafile:

"/tableRoot/data/ts_day="2020-10-24"/ffffgggg.orc

If you then change to specB (e.g. day(ts), name), the same record would go to a 
different partition:

"/tableRoot/data/ts_day="2020-10-24"/name="Mike"/ffffgggg.orc

If you then want to overwrite the table with itself, it will detect these two 
records to belong to different partitions (as they do), and therefore does not 
overwrite the original record with the new one, resulting in duplicate entries.


{code:java}
create table testice1000 (a int, b string) stored by iceberg stored as orc 
location 'file:/tmp/testice1000';
insert into testice1000 values (11, 'ddd'), (22, 'ttt');
alter table testice1000 set partition spec(truncate(2, b));
insert into testice1000 values (33, 'rrfdfdf');
insert overwrite table testice1000 select * from testice1000;
------------------------------+

testice1000.a testice1000.b
------------------------------+

11 ddd   
11 ddd   
22 ttt   
22 ttt   
33 rrfdfdf
------------------------------+
{code}




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to