devinjdangelo commented on PR #7992: URL: https://github.com/apache/arrow-datafusion/pull/7992#issuecomment-1786188264
> Fair enough, I think the justification would be if you know that the file is going to be less than the 10MB minimum part size, a multipart upload is not going to do anything but add request and buffer copying overheads. That being said it might still help for the local filesystem use-case... Yeah agreed. My thinking is that if you are writing a few <10MB file performance is almost certainly not an issue. If you are writing dozens or 100s of <=10MB files, you probably just shouldn't do that and instead write a few files in the 256MB-2GB range with multipart put. > Just from a cursory perusal of the code, a lot of the design complexity appears to arise from the append use-case, including things like whether to write headers or not. This also causes pain on the read side. I wonder if we could better encapsulate this streaming use-case, potentially in its own operators, so as to simplify the general design Yes, agreed. There are currently a few hacky workarounds to allow the streaming and appending code to coexist with some of the parallelization code (such as [here](https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/src/datasource/file_format/write/orchestration.rs#L79). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
