stevenzwu commented on issue #6514:
URL: https://github.com/apache/iceberg/issues/6514#issuecomment-1444388008

   Just to add to @rdblue 's good points above. 
   
   Regarding the watermark use cases, we can prefix the watermark (in snapshot 
metadata) from each writer job to avoid the conflict. The downside is that 
consumer from the watermark info need to aggregate and take the min value. if 
we go with the conditional commit approach and fail the second commit with 
lower watermark, how would the second application handle the failure? we can 
also get into the situation where the second application may never able to 
commit if it's watermark is forever behind the first application. The condition 
check will always be false.
   
   Regarding the Kafka data file committer use case, it is non-desirable to 
have all the parallel threads committing to the Iceberg table. If the 
parallelism is 100 or 1,000, there will be a lot of collisions and retries. The 
conditional commit can ensure the correctness. But it can be inefficient or 
infeasible with high parallelism. Flink Iceberg sink coalescing all data files 
to a single committer task so that there is only one committer thread (in a 
Flink job) committing to the iceberg table.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to