szehon-ho opened a new issue #3086:
URL: https://github.com/apache/iceberg/issues/3086


   The current "write.object-storage.path" is a good solution for throttling on 
object-stores like S3.
   
   However, we like to still use the built-in 'removeOrphan' functionality for 
tracking storage costs/privacy.  Remove Orphan relies on s3 listing under table 
location to find the list of potential orphan files.  Using the recommended 
'write.object-storage.path' like 's3://bucket/' breaks removeOrphans, as 
resultant orphan data files: 
's3://bucket/xxxxxxxx/db/table/data/part=a/0000.parquet' are not identified 
listed under the table location.
   
   The only solution seems to be to set the write.object-storage.path to be the 
table location manually.  This does also allow less throttling (with object 
store prefixes set to each table location plus some characters of the hash)
   
   But it is not so user friendly:
   
   - Path becomes :  's3://bucket/db/table/xxxxxxxx/db/table/p=a/0000.parquet'. 
 In particular, the 'db/table' are repeated, adding unnecessarily to path 
length.  It is also a bit awkward in general.
   - From UX perspective, user needs to be cognizant of the table location when 
setting this property (table location is optional in any create table statement 
and many users do not set it)
   
   It will be great to have a new property like 
'write.object-storage.table-location', which if enabled to true:
   
   - sets the path to be the more reasonable and efficient 
's3://bucket/db/table/xxxxxxxx/p=a/0000.parquet' (removing /data/db/table/ just 
as /data was removed by the original implementation)
   - sets the path to automatically following the default table-location 
without user having to specify it.  Changing table-location will automatically 
change the write object storage path.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to