rdblue commented on pull request #2867: URL: https://github.com/apache/iceberg/pull/2867#issuecomment-891978223
From the description, it sounds like the rewrite happens in the committer task rather than in parallel. Is there a good way to make this happen in parallel instead? What we discussed elsewhere was doing a compaction by adding a new parallel stage and second committer after the initial committer. The current commit task would output committed `DataFile` instances after the commit succeeds. Then those would be sent to compaction writers using `keyBy` and the partition. Once a compacted data file is large enough, the compaction writer will emit it as a `DataFile` along with the `DataFile` instances that were compacted. Those would be collected by the compaction committer, which would commit a rewrite every checkpoint where there is at least one compacted file. I think that we should plan on having some parallelism here, or else this is not going to be a very useful feature. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
