Re: [PR] Fix AsyncPutWriter (#7991) [arrow-datafusion]

via GitHub Mon, 30 Oct 2023 16:12:22 -0700


devinjdangelo commented on PR #7992:
URL: 
https://github.com/apache/arrow-datafusion/pull/7992#issuecomment-1786188264

> Fair enough, I think the justification would be if you know that the file
is going to be less than the 10MB minimum part size, a multipart upload is not
going to do anything but add request and buffer copying overheads. That being
said it might still help for the local filesystem use-case...

Yeah agreed. My thinking is that if you are writing a few <10MB file
performance is almost certainly not an issue. If you are writing dozens or 100s
of <=10MB files, you probably just shouldn't do that and instead write a few
files in the 256MB-2GB range with multipart put.

> Just from a cursory perusal of the code, a lot of the design complexity
appears to arise from the append use-case, including things like whether to
write headers or not. This also causes pain on the read side. I wonder if we
could better encapsulate this streaming use-case, potentially in its own
operators, so as to simplify the general design

Yes, agreed. There are currently a few hacky workarounds to allow the
streaming and appending code to coexist with some of the parallelization code
(such as
[here](https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/src/datasource/file_format/write/orchestration.rs#L79).

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Fix AsyncPutWriter (#7991) [arrow-datafusion]

Reply via email to