[GitHub] [iceberg] sweetpythoncode commented on issue #6306: Add ignoreDuplicates option for add_files procedure

via GitHub Tue, 14 Mar 2023 01:36:45 -0700


sweetpythoncode commented on issue #6306:
URL: https://github.com/apache/iceberg/issues/6306#issuecomment-1467640396


   @szehon-ho(cc @RussellSpitzer )
   Thanks for checking this, here is the case which I faced:
   1. We have a hive structure with partitions:
   ```
   s3://bucket/data/id=1/name=test/timestamp=20231101191213/result.orc
   s3://bucket/data/id=2/name=test-2/timestamp=20231101191213/result.orc
   ```
   2. We have partitioning by `id, name, timestamp` in the table.
   3. We run the procedure to `add_files` from hive into iceberg:
   ```
   CALL iceberg_catalog.system.add_files(
   table => 'test.test_name',
   source_table => '`orc`.`s3://bucket/data/`'
   )
   ```
   4. After some time **uncontrolled** hive system adds a new partition and 
removes the old one(for `timestamp`, it needed to control process version in 
other systems like Trino to avoid duplicates), now we have:
   ```
   s3://bucket/data/id=1/name=test/timestamp=20231101191213/result.orc
   s3://bucket/data/id=2/name=test-2/timestamp=20231101191217/result.orc <- old 
timestamp removed, new one added
   ```
   5. I want to sync the iceberg table to watch the new partition and remove 
removed from the bucket.
   6. We run these steps to register that one changed partition, otherwise 
iceberg will fail with the file does not exist:
   ```
   delete from test.test_name where 1=1;
   CALL iceberg_catalog.system.add_files(
        table => 'test.test_name',
        source_table => '`orc`.`s3://bucket/data/`',
        check_duplicate_files => false
   )
   CALL iceberg_catalog.system.expire_snapshots('test.test_name', TIMESTAMP 
'{now}', 1)
   ```
   _use case can be changed with a new features at least to that:_
   6.
   ```
   delete from test.test_name where timestamp = 20231101191213; <- removed 
partitions from bucket
   CALL iceberg_catalog.system.add_files(
        table => 'test.test_name',
        source_table => '`orc`.`s3://bucket/data/`',
        duplicate_file_mode => 'skip'
   ) <- here we register only the new partition, instead of scanning the whole 
data again.
   CALL iceberg_catalog.system.expire_snapshots('test.test_name', TIMESTAMP 
'{now}', 1) <- to expire old removed partition from snapshots
   ```
   ------------------
   Also here is another workaround to add the full path to the new partition 
but I cannot just use the full path to the new partition because Iceberg will 
ignore partitions in the path and set them as null in a table, [here is link to 
issue](https://github.com/apache/iceberg/issues/7027)
   ```
   CALL iceberg_catalog.system.add_files(
        table => 'test.test_name',
        source_table => 
'`orc`.`s3://bucket/data/id=2/name=test-2/timestamp=20231101191217/`'
   )
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] sweetpythoncode commented on issue #6306: Add ignoreDuplicates option for add_files procedure

Reply via email to