stevenzwu commented on pull request #2867:
URL: https://github.com/apache/iceberg/pull/2867#issuecomment-919272970


   > In this PR the rewriteAction of flink is parallel, it will not make data 
deal slow down.
   
   @hameizi by parallel, we meant multiple executors/tasks executing the 
rewrite. Last time I checked, this PR runs the whole rewrite action in the 
single committer task synchronously. that is the main scalability concern we 
have.
   
   Also notifyCheckpointComplete (and snapshotState) executes in the mailbox 
thread. if it takes a long time to finish the notifyCheckpointComplete/rewrite, 
it can delay the checkpoint execution.
   
   I share the same philosophy as Jack on keep the streaming ingestion simple 
and stable. It is critical to **reliably** ingest data into long-term data 
storage (like Iceberg) first, as streaming input (like Kafka) typically has 
short retention.
   
   > [Handle the case that RewriteFiles and RowDelta commit the transaction at 
the same time #2308](https://github.com/apache/iceberg/issues/2308)
   
   regarding this issue, I agree that the lock steps of commit + compaction can 
avoid the problem. But it is not a solution for the general problem, because 
other users probably have compaction jobs like Spark. There are other more 
sophisticated compaction/rewrite actions that probably can't be supported by 
single-task rewrite action at scale.
   
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to