jackye1995 commented on pull request #2867:
URL: https://github.com/apache/iceberg/pull/2867#issuecomment-918632602


   I have also been following this thread although I did no make any comment. 
Let me add some thoughts since I see you are making some new changes.
   
   I am mostly on the same line of thought as @stevenzwu, I am a bit worried 
about the scalability of the current implementation, and I think the parallel 
commit proposal that @rdblue proposed could work, but in the end running 
compaction in streaming pipeline is likely unnecessary complication. 
   
   So far we have been advocating for streaming pipelines to just commit new 
files to storage, and use a separated process to handle compaction at the same 
time. Having the streaming pipeline also do compaction would mean that there 
might be 2 compaction processes competing with each other. This becomes 
especially complicated and prone to error when you have both batch jobs and 
streaming pipelines running at the same time (e.g. normal streaming + daily 
loading of corrected and late data). I understand it is likely a good 
optimization for simple use cases, but I would expect it to be a feature with a 
lot of in-depth knowledge to use safely and correctly if we open it for general 
usage.
   
   I wonder what is the initial drive behind this implementation. Do you just 
want to avoid a separated Spark cluster to run compaction in Spark? If we have 
Flink actions specifically for `RewriteDataFiles` and `RewriteDeleteFiles` that 
you can schedule on the same Flink cluster, would that solve the issue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to