Tom-Newton opened a new issue, #40036:
URL: https://github.com/apache/arrow/issues/40036

   ### Describe the enhancement requested
   
   Optimisation to https://github.com/apache/arrow/issues/38333
   Child of https://github.com/apache/arrow/issues/18014
   
   Currently `ObjectAppendStream::DoAppend` calls 
`block_blob_client_->StageBlock` synchronously meaning that the call to 
`ObjectAppendStream::DoAppend` blocks until the data has been successfully 
written to blob storage. This is very in-efficient for large numbers of small 
writes. 
   
   This performance problem is actually quite obvious just in small tests 
against azurite. The 
[`UploadLines`](https://github.com/apache/arrow/blob/d9891918a42c74002533ed5c06c42b7d0d070820/cpp/src/arrow/filesystem/azurefs_test.cc#L529-L535)
 function used to create test data uses `std::accumulate` and writes the data 
in one call for performance reasons.  
   
   With accumulate
   ```
   [ RUN      ] TestAzuriteFileSystem.OpenInputFileMixedReadVsReadAt
   [       OK ] TestAzuriteFileSystem.OpenInputFileMixedReadVsReadAt (1350 ms)
   ```
   
   without accumulate (4096 separate calls to `ObjectAppendStream::DoAppend`).
   ```
   [ RUN      ] TestAzuriteFileSystem.OpenInputFileMixedReadVsReadAt
   [       OK ] TestAzuriteFileSystem.OpenInputFileMixedReadVsReadAt (25124 ms)
   ```
   And this is when testing against `azurite` on localhost so against real blob 
storage where the latency is going to be much higher the problem will be 
exacerbated.  
   
   By comparison the GCS filesystem is able to handle the later approach 
without performance issues. 
https://github.com/apache/arrow/blob/d9891918a42c74002533ed5c06c42b7d0d070820/cpp/src/arrow/filesystem/gcsfs_test.cc#L1282-L1286
   
   Some options to optimise:
   1. Call `block_blob_client_->StageBlock` asynchronously and await all the 
futures in `ObjectAppendStream::Flush`.
   2. Buffer small writes in memory and make fewer smaller calls to 
`block_blob_client_->StageBlock`.
   3. Buffer small writes in memory and make batched calls to 
`block_blob_client_->StageBlock`.
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to