JanKaul commented on issue #6339: URL: https://github.com/apache/arrow-datafusion/issues/6339#issuecomment-1545937647
Thanks for your explanation. There is just one small detail that would be great for Iceberg/Deltalake. Apache Iceberg and Deltalake use MVCC to guarantee atomic transactions on tables. Therefore they optimistically write the data of a transaction to some kind of storage. Once the data is written, the metadata of the table is updated if no other process has updated the metadata in the meantime. With the approach that you are suggesting, the atomicity of a write transaction can only be guaranteed on a partition basis. This could lead to the scenario where one partition is written successfully, then another process updates the metadata and writing the following partitions could fail. Moreover, one would need to be really careful that the different asynchronous tasks don't invalidate the write operation from the other tasks even in the same logical write transaction. The current porposal for the `DataSink` trait definitely simplifies the implementation of insert operations for Iceberg and Deltalake tables. It just doesn't allow the implementation of the full functionality for Iceberg and Deltalake. I just wanted to mention that. Forget my comment on the asynchronous method, I somehow missed the BoxFuture. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
