kevinjqliu commented on issue #1092: URL: https://github.com/apache/iceberg-python/issues/1092#issuecomment-4001768913
Thanks for taking the time to look into this @qzyu999! I think this is on the right track. Looking at the `rewrite_data_files` implementation in spark, theres a lot of bells and whistles (probably added over time). For the pyiceberg implementation, It might be useful to scope the feature down as much as possible; just to create a harness and we can improve it over time. What do you think about first handling the case for compaction of a whole table? That way to don't have to deal with `filter` and matching data files. Im thinking something like `table.maintenance.compact()`, which will rewrite the table using the `REPLACE` operation. For the actual data files, we can take a shortcut and just binpack by reading the table and writing it out again. This should produce the desired file size specified by `write.target-file-size-bytes` (which the write path already uses) WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
