Re: [PR] Use object_store:BufWriter to replace put_multipart [arrow-datafusion]

via GitHub Tue, 19 Mar 2024 07:50:46 -0700


alamb commented on code in PR #9648:
URL: https://github.com/apache/arrow-datafusion/pull/9648#discussion_r1530533085



##########
datafusion/core/src/datasource/file_format/write/mod.rs:
##########
@@ -150,19 +74,13 @@ pub trait BatchSerializer: Sync + Send {
     fn serialize(&self, batch: RecordBatch, initial: bool) -> Result<Bytes>;
 }
 
-/// Returns an [`AbortableWrite`] which writes to the given object store 
location
+/// Returns an [`AsyncWrite`] which writes to the given object store location
 /// with the specified compression
 pub(crate) async fn create_writer(
     file_compression_type: FileCompressionType,
     location: &Path,
     object_store: Arc<dyn ObjectStore>,
-) -> Result<AbortableWrite<Box<dyn AsyncWrite + Send + Unpin>>> {
-    let (multipart_id, writer) = object_store

Review Comment:
   For anyone else following along, the `BufWriter` internally does a 
multi-part put when appropriate



##########
datafusion/core/src/datasource/file_format/write/mod.rs:
##########
@@ -69,79 +66,6 @@ impl Write for SharedBuffer {
     }
 }
 
-/// Stores data needed during abortion of MultiPart writers

Review Comment:
   Is it correct to say that the implications of removing `AbortableWrite`  is 
that if certain (larger) writes to object store fail / abort for some reason, 
"garbage" (unreferenced partial uploads) may be left around indefinitely on the 
provider? 
   
   While I understand that some object stores (maybe all) can be configured to 
automatically clean up such parts, I think reverting the "try to cleanup on 
failure" behavior is worth reconsidering. 
   
   I think I could be convinced with an argument like "the software can't 
ensure clean up anyways (for example, if it is `SIGKILL`ed) for some reason, 
and thus we don't even try to clean up in paths we could", but if we go that 
route I think we should explicitly document the behavior and rationale in 
comments somewhere
   
   
   I think @metesynnada or @mustafasrepo  originally added this code (though I 
may be wrong) so perhaps they have some perspective to share
   



##########
datafusion/core/src/datasource/file_format/file_compression_type.rs:
##########
@@ -152,7 +153,7 @@ impl FileCompressionType {
     /// according to this `FileCompressionType`.
     pub fn convert_async_writer(
         &self,
-        w: Box<dyn AsyncWrite + Send + Unpin>,
+        w: BufWriter,

Review Comment:
   I think it depends on if this code is always used with `object_store` (aka 
if DataFusion code always writes output using the `object_store` API). If this 
is the case, then switching to `BufWriter` here makes sense to me
   
   BTW I think we need to update the comments on this function to match the new 
implementation



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Use object_store:BufWriter to replace put_multipart [arrow-datafusion]

Reply via email to