[GitHub] [arrow-datafusion] JanKaul commented on issue #6339: Simplified TableProvider::Insert API

via GitHub Fri, 12 May 2023 08:38:50 -0700


JanKaul commented on issue #6339:
URL: 
https://github.com/apache/arrow-datafusion/issues/6339#issuecomment-1545937647


   Thanks for your explanation. There is just one small detail that would be 
great for Iceberg/Deltalake.
   
   Apache Iceberg and Deltalake use MVCC to guarantee atomic transactions on 
tables. Therefore they optimistically write the data of a transaction to some 
kind of storage. Once the data is written, the metadata of the table is updated 
if no other process has updated the metadata in the meantime. 
   With the approach that you are suggesting, the atomicity of a write 
transaction can only be guaranteed on a partition basis. This could lead to the 
scenario where one partition is written successfully, then another process 
updates the metadata and writing the following partitions could fail. Moreover, 
one would need to be really careful that the different asynchronous tasks don't 
invalidate the write operation from the other tasks even in the same logical 
write transaction.
   
   The current porposal for the `DataSink` trait definitely simplifies the 
implementation of insert operations for Iceberg and Deltalake tables. It just 
doesn't allow the implementation of the full functionality for Iceberg and 
Deltalake. I just wanted to mention that.
   
   Forget my comment on the asynchronous method, I somehow missed the BoxFuture.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] JanKaul commented on issue #6339: Simplified TableProvider::Insert API

Reply via email to