Tom-Newton opened a new issue, #40036: URL: https://github.com/apache/arrow/issues/40036
### Describe the enhancement requested Optimisation to https://github.com/apache/arrow/issues/38333 Child of https://github.com/apache/arrow/issues/18014 Currently `ObjectAppendStream::DoAppend` calls `block_blob_client_->StageBlock` synchronously meaning that the call to `ObjectAppendStream::DoAppend` blocks until the data has been successfully written to blob storage. This is very in-efficient for large numbers of small writes. This performance problem is actually quite obvious just in small tests against azurite. The [`UploadLines`](https://github.com/apache/arrow/blob/d9891918a42c74002533ed5c06c42b7d0d070820/cpp/src/arrow/filesystem/azurefs_test.cc#L529-L535) function used to create test data uses `std::accumulate` and writes the data in one call for performance reasons. With accumulate ``` [ RUN ] TestAzuriteFileSystem.OpenInputFileMixedReadVsReadAt [ OK ] TestAzuriteFileSystem.OpenInputFileMixedReadVsReadAt (1350 ms) ``` without accumulate (4096 separate calls to `ObjectAppendStream::DoAppend`). ``` [ RUN ] TestAzuriteFileSystem.OpenInputFileMixedReadVsReadAt [ OK ] TestAzuriteFileSystem.OpenInputFileMixedReadVsReadAt (25124 ms) ``` And this is when testing against `azurite` on localhost so against real blob storage where the latency is going to be much higher the problem will be exacerbated. By comparison the GCS filesystem is able to handle the later approach without performance issues. https://github.com/apache/arrow/blob/d9891918a42c74002533ed5c06c42b7d0d070820/cpp/src/arrow/filesystem/gcsfs_test.cc#L1282-L1286 Some options to optimise: 1. Call `block_blob_client_->StageBlock` asynchronously and await all the futures in `ObjectAppendStream::Flush`. 2. Buffer small writes in memory and make fewer smaller calls to `block_blob_client_->StageBlock`. 3. Buffer small writes in memory and make batched calls to `block_blob_client_->StageBlock`. ### Component(s) C++ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
