[jira] [Work logged] (HIVE-26133) Insert overwrite on Iceberg tables can result in duplicate entries after partition evolution

ASF GitHub Bot (Jira) Tue, 12 Apr 2022 06:47:07 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-26133?focusedWorklogId=755759&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-755759
 ]


ASF GitHub Bot logged work on HIVE-26133:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 12/Apr/22 13:46
            Start Date: 12/Apr/22 13:46
    Worklog Time Spent: 10m 
      Work Description: lcspinter commented on code in PR #3202:
URL: https://github.com/apache/hive/pull/3202#discussion_r848456955


##########
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java:
##########
@@ -460,6 +461,13 @@ public void validateSinkDesc(FileSinkDesc sinkDesc) throws 
SemanticException {
       if (IcebergTableUtil.isBucketed(table)) {
         throw new SemanticException("Cannot perform insert overwrite query on 
bucket partitioned Iceberg table.");
       }
+      if (table.currentSnapshot() != null) {
+        if 
(table.currentSnapshot().allManifests().parallelStream().map(ManifestFile::partitionSpecId)
+            .filter(id -> id < table.spec().specId()).findAny().isPresent()) {
+          throw new SemanticException(
+              "Cannot perform insert overwrite query on Iceberg table where 
partition evolution happened.");

Review Comment:
   I didn't add any recommendations, because I don't know how can we enforce 
the rewrite of the data. Do we have an example query? 





Issue Time Tracking
-------------------

    Worklog Id:     (was: 755759)
    Time Spent: 1h 20m  (was: 1h 10m)

> Insert overwrite on Iceberg tables can result in duplicate entries after 
> partition evolution
> --------------------------------------------------------------------------------------------
>
>                 Key: HIVE-26133
>                 URL: https://issues.apache.org/jira/browse/HIVE-26133
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: László Pintér
>            Assignee: László Pintér
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Insert overwrite commands in Hive only rewrite partitions affected by the 
> query.
> If we write out a record with specA (e.g. day(ts)), resulting in a datafile:
> "/tableRoot/data/ts_day="2020-10-24"/ffffgggg.orc
> If you then change to specB (e.g. day(ts), name), the same record would go to 
> a different partition:
> "/tableRoot/data/ts_day="2020-10-24"/name="Mike"/ffffgggg.orc
> If you then want to overwrite the table with itself, it will detect these two 
> records to belong to different partitions (as they do), and therefore does 
> not overwrite the original record with the new one, resulting in duplicate 
> entries.
> {code:java}
> create table testice1000 (a int, b string) stored by iceberg stored as orc 
> location 'file:/tmp/testice1000';
> insert into testice1000 values (11, 'ddd'), (22, 'ttt');
> alter table testice1000 set partition spec(truncate(2, b));
> insert into testice1000 values (33, 'rrfdfdf');
> insert overwrite table testice1000 select * from testice1000;
> ------------------------------+
> testice1000.a testice1000.b
> ------------------------------+
> 11 ddd   
> 11 ddd   
> 22 ttt   
> 22 ttt   
> 33 rrfdfdf
> ------------------------------+
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Work logged] (HIVE-26133) Insert overwrite on Iceberg tables can result in duplicate entries after partition evolution

Reply via email to