sweetpythoncode commented on issue #6306:
URL: https://github.com/apache/iceberg/issues/6306#issuecomment-1467640396
@szehon-ho(cc @RussellSpitzer )
Thanks for checking this, here is the case which I faced:
1. We have a hive structure with partitions:
```
s3://bucket/data/id=1/name=test/timestamp=20231101191213/result.orc
s3://bucket/data/id=2/name=test-2/timestamp=20231101191213/result.orc
```
2. We have partitioning by `id, name, timestamp` in the table.
3. We run the procedure to `add_files` from hive into iceberg:
```
CALL iceberg_catalog.system.add_files(
table => 'test.test_name',
source_table => '`orc`.`s3://bucket/data/`'
)
```
4. After some time **uncontrolled** hive system adds a new partition and
removes the old one(for `timestamp`, it needed to control process version in
other systems like Trino to avoid duplicates), now we have:
```
s3://bucket/data/id=1/name=test/timestamp=20231101191213/result.orc
s3://bucket/data/id=2/name=test-2/timestamp=20231101191217/result.orc <- old
timestamp removed, new one added
```
5. I want to sync the iceberg table to watch the new partition and remove
removed from the bucket.
6. We run these steps to register that one changed partition, otherwise
iceberg will fail with the file does not exist:
```
delete from test.test_name where 1=1;
CALL iceberg_catalog.system.add_files(
table => 'test.test_name',
source_table => '`orc`.`s3://bucket/data/`',
check_duplicate_files => false
)
CALL iceberg_catalog.system.expire_snapshots('test.test_name', TIMESTAMP
'{now}', 1)
```
_use case can be changed with a new features at least to that:_
6.
```
delete from test.test_name where timestamp = 20231101191213; <- removed
partitions from bucket
CALL iceberg_catalog.system.add_files(
table => 'test.test_name',
source_table => '`orc`.`s3://bucket/data/`',
duplicate_file_mode => 'skip'
) <- here we register only the new partition, instead of scanning the whole
data again.
CALL iceberg_catalog.system.expire_snapshots('test.test_name', TIMESTAMP
'{now}', 1) <- to expire old removed partition from snapshots
```
------------------
Also here is another workaround to add the full path to the new partition
but I cannot just use the full path to the new partition because Iceberg will
ignore partitions in the path and set them as null in a table, [here is link to
issue](https://github.com/apache/iceberg/issues/7027)
```
CALL iceberg_catalog.system.add_files(
table => 'test.test_name',
source_table =>
'`orc`.`s3://bucket/data/id=2/name=test-2/timestamp=20231101191217/`'
)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]