szehon-ho opened a new issue #3086: URL: https://github.com/apache/iceberg/issues/3086
The current "write.object-storage.path" is a good solution for throttling on object-stores like S3. However, we like to still use the built-in 'removeOrphan' functionality for tracking storage costs/privacy. Remove Orphan relies on s3 listing under table location to find the list of potential orphan files. Using the recommended 'write.object-storage.path' like 's3://bucket/' breaks removeOrphans, as resultant orphan data files: 's3://bucket/xxxxxxxx/db/table/data/part=a/0000.parquet' are not identified listed under the table location. The only solution seems to be to set the write.object-storage.path to be the table location manually. This does also allow less throttling (with object store prefixes set to each table location plus some characters of the hash) But it is not so user friendly: - Path becomes : 's3://bucket/db/table/xxxxxxxx/db/table/p=a/0000.parquet'. In particular, the 'db/table' are repeated, adding unnecessarily to path length. It is also a bit awkward in general. - From UX perspective, user needs to be cognizant of the table location when setting this property (table location is optional in any create table statement and many users do not set it) It will be great to have a new property like 'write.object-storage.table-location', which if enabled to true: - sets the path to be the more reasonable and efficient 's3://bucket/db/table/xxxxxxxx/p=a/0000.parquet' (removing /data/db/table/ just as /data was removed by the original implementation) - sets the path to automatically following the default table-location without user having to specify it. Changing table-location will automatically change the write object storage path. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
