rdblue commented on pull request #2867:
URL: https://github.com/apache/iceberg/pull/2867#issuecomment-891978223


   From the description, it sounds like the rewrite happens in the committer 
task rather than in parallel. Is there a good way to make this happen in 
parallel instead?
   
   What we discussed elsewhere was doing a compaction by adding a new parallel 
stage and second committer after the initial committer. The current commit task 
would output committed `DataFile` instances after the commit succeeds. Then 
those would be sent to compaction writers using `keyBy` and the partition. Once 
a compacted data file is large enough, the compaction writer will emit it as a 
`DataFile` along with the `DataFile` instances that were compacted. Those would 
be collected by the compaction committer, which would commit a rewrite every 
checkpoint where there is at least one compacted file.
   
   I think that we should plan on having some parallelism here, or else this is 
not going to be a very useful feature.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to