stevenzwu commented on pull request #2867: URL: https://github.com/apache/iceberg/pull/2867#issuecomment-919272970
> In this PR the rewriteAction of flink is parallel, it will not make data deal slow down. @hameizi by parallel, we meant multiple executors/tasks executing the rewrite. Last time I checked, this PR runs the whole rewrite action in the single committer task synchronously. that is the main scalability concern we have. Also notifyCheckpointComplete (and snapshotState) executes in the mailbox thread. if it takes a long time to finish the notifyCheckpointComplete/rewrite, it can delay the checkpoint execution. I share the same philosophy as Jack on keep the streaming ingestion simple and stable. It is critical to **reliably** ingest data into long-term data storage (like Iceberg) first, as streaming input (like Kafka) typically has short retention. > [Handle the case that RewriteFiles and RowDelta commit the transaction at the same time #2308](https://github.com/apache/iceberg/issues/2308) regarding this issue, I agree that the lock steps of commit + compaction can avoid the problem. But it is not a solution for the general problem, because other users probably have compaction jobs like Spark. There are other more sophisticated compaction/rewrite actions that probably can't be supported by single-task rewrite action at scale. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
