tustvold commented on issue #5130:
URL: 
https://github.com/apache/arrow-datafusion/issues/5130#issuecomment-1449968315

   > To support mutability the TableProvider implementation would need to 
implement "interior mutability"
   
   I think this touches on a key point, there needs to be some sort of 
consistency/atomicity story here. Most users would likely expect that `INSERT 
INTO` is atomic, i.e. a query sees all the inserted data or none of the 
inserted data. They may additionally have expectations with respect to 
transaction isolation / serializability. 
   
   **Blindly appending to a CSV / JSON file without any external coordination 
will result in queries seeing partial or potentially corrupted data** 
   
   One common approach is for new data to always be written to a new file, thus 
ensuring atomicity. 
   
   This basic approach can then be optionally extended with things like:
   
   * A Write-Ahead Log and MemTable to reduce file churn
   * Catalog functionality, such as provided by deltalake or lakehouse, to 
support in-place, atomic rewrites, transactions, etc...
   * Compaction functionality (deltalake calls this 
[bin-packing](https://docs.delta.io/1.2.1/optimizations-oss.html#compaction-bin-packing))
 to coalesce small files into larger ones
   
   I think adding some pieces of functionality for this to DataFusion would be 
amazing, and may even be of interest to the delta-rs folks (FYI @roeap), but 
may benefit from having a more fleshed out catalog story first (#5291)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to