fqaiser94 commented on issue #6514:
URL: https://github.com/apache/iceberg/issues/6514#issuecomment-1451250830

   Hi @stevenzwu, 
   
   Great job presenting at subsurface conference today :) 
   Regarding your comments: 
   
   > Regarding the Kafka data file committer use case, it is non-desirable to 
have all the parallel threads committing to the Iceberg table. If the 
parallelism is 100 or 1,000, there will be a lot of collisions and retries. The 
conditional commit can ensure the correctness. But it can be inefficient or 
infeasible with high parallelism. Flink Iceberg sink coalescing all data files 
to a single committer task so that there is only one committer thread (in a 
Flink job) committing to the iceberg table.
   
   I fully agree with you on all the points you've stated here. 
   If the parallelism is high, conditional commits will be extremely 
inefficient to the point of infeasible even though it guarantees "correctness."
   In general, it is best to have just a single thread committing updates to an 
iceberg table, which is what Flink Iceberg Sink does.
   This is also exactly what we try to do in our custom system. 
   Our issue however is that we cannot guarantee that there will **always** be 
exactly a single thread committing a given datafile to the iceberg table. 
   In **_rare_** situations, and for only a **_brief_** moment in time, there 
may be more than one thread attempting to commit **_the same set of 
datafiles_** to the same iceberg table **_at the same time_**. 
   Due to the distributed nature of our datafile-committer application, it is 
impossible to completely avoid this type of situation ever happening, we can 
only make them rare. 
   This is where we want to use conditional commits as a last line of defence 
to ensure that only one of these threads is successful in committing the 
datafile during these rare and brief moments of instability. 
   
   I hope this explanation clarifies why conditional commits is still useful 
for datafile-committer use-cases like ours where you can guarantee that the 
majority of the time there will not be any conflicts. 
   
   > if we go with the conditional commit approach and fail the second commit 
with lower watermark, how would the second application handle the failure? we 
can also get into the situation where the second application may never able to 
commit. if it's watermark is forever behind the first application, the 
condition check will always be false. not sure if this is a desirable behavior.
   
   The desired behaviour here will likely vary depending on the use-case. 
   For example, for our use-case, the second application which is hit with a 
failure would abandon trying to commit that set of datafiles and move on to the 
next set of datafiles. 
   Due to the nature of our application, we're guaranteed to eventually (and 
quickly) converge to a stable condition where each instance will be left 
handling messages that cannot conflict. 
   
   As far as the iceberg library is concerned, IMO it should just provide the 
fundamental building block i.e. a way for iceberg-api users to express the 
conditions under which a commit should proceed, and leave how to handle 
failures up to the API users. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to