adutra commented on issue #3621: URL: https://github.com/apache/polaris/issues/3621#issuecomment-3834628470
After more investigation on this, I think there is a much better option to improve entropy and reduce hotspots across different tables: just set `write.data.path` for all tables to a common, short prefix, e.g. the warehouse location. Indeed Iceberg's Object Store layout behaves differently when the table has `write.data.path` defined. Instead of creating file path like this: ``` <table-base>/data/<file-hash>/[<partition>/]file1.parquet ``` The layout creates paths like this: ``` <write.data.path>/<file-hash>/<ns>/<table>/[<partition>/]file1.parquet ``` This setup effectively scatters table files and creates high entropy: ``` s3://bucket/warehouse/1111/1111/0100/01010000/ns1/table1/data/file1.parquet s3://bucket/warehouse/0011/0110/0101/10101100/ns1/table1/data/file2.parquet s3://bucket/warehouse/1110/0011/0100/11010000/ns2/table2/data/file1.parquet s3://bucket/warehouse/0111/0100/0101/01101101/ns2/table2/data/file2.parquet ``` While Polaris layout achieves lesser entropy since the hash is per-table and not per-file: ``` s3://bucket/warehouse/1111/1111/0100/01010000/ns1/table1/data/file1.parquet s3://bucket/warehouse/1111/1111/0100/01010000/ns1/table1/data/file2.parquet s3://bucket/warehouse/1001/0011/1101/11010111/ns2/table2/data/file1.parquet s3://bucket/warehouse/1001/0011/1101/11010111/ns2/table2/data/file2.parquet ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
