devinjdangelo commented on PR #7992:
URL: 
https://github.com/apache/arrow-datafusion/pull/7992#issuecomment-1786188264

   > Fair enough, I think the justification would be if you know that the file 
is going to be less than the 10MB minimum part size, a multipart upload is not 
going to do anything but add request and buffer copying overheads. That being 
said it might still help for the local filesystem use-case...
   
   Yeah agreed. My thinking is that if you are writing a few <10MB file 
performance is almost certainly not an issue. If you are writing dozens or 100s 
of <=10MB files, you probably just shouldn't do that and instead write a few 
files in the 256MB-2GB range with multipart put.
   
   > Just from a cursory perusal of the code, a lot of the design complexity 
appears to arise from the append use-case, including things like whether to 
write headers or not. This also causes pain on the read side. I wonder if we 
could better encapsulate this streaming use-case, potentially in its own 
operators, so as to simplify the general design
   
   Yes, agreed. There are currently a few hacky workarounds to allow the 
streaming and appending code to coexist with some of the parallelization code 
(such as 
[here](https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/src/datasource/file_format/write/orchestration.rs#L79).
  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to