wjones127 commented on PR #35808:
URL: https://github.com/apache/arrow/pull/35808#issuecomment-1574637403

   > I suspect this shouldn't affect performance, but have you tried a large 
upload against a remote endpoint?
   
   Just tested on an EC2 instance in the same region as an S3 bucket. They seem 
comparable.
   
   Before:
   ```
   Upload 200MB data in 25 pieces (8.0 MB per piece).
   Avg: 2.4996164005998254 Min: 1.839938310999969 Max: 6.8072016670002995
   
   Upload 200MB data in 4 pieces (50.0 MB per piece).
   Avg: 2.3683339380998403 Min: 2.1181160359992646 Max: 3.710967674999665
   ```
   
   After:
   
   ```
   Upload 200MB data in 25 pieces (8.0 MB per piece).
   Avg: 1.989240413999869 Min: 1.9223971040009928 Max: 2.2479570469986356
   
   Upload 200MB data in 4 pieces (50.0 MB per piece).
   Avg: 2.1779588877003335 Min: 2.1119931640005234 Max: 2.3420757010007947
   ```
   <details>
   <summary>Benchmark script</summary>
   
   ```python
   import pyarrow.fs as pa_fs
   import random
   import os
   import time
   
   fs = pa_fs.S3FileSystem(
       region="us-west-2"
   )
   
   # 100 MB across 25 pieces
   pieces = 25
   piece_size = 200 * 1024 * 1024 // pieces
   
   bucket_name = os.environ["OBJECT_STORE_BUCKET"]
   
   fs = pa_fs.S3FileSystem(
       access_key=os.environ["AWS_ACCESS_KEY_ID"],
       secret_key=os.environ["AWS_SECRET_ACCESS_KEY"],
       region="us-west-2",
       background_writes=True, # Also test false
   )
   
   iterations = 10
   times = []
   
   for _ in range(iterations):
       start = time.monotonic()
       with fs.open_output_stream(f"{bucket_name}/my_test_file.bin") as f:
           for i in range(pieces):
               data = random.randbytes(piece_size)
               f.write(data)
       end = time.monotonic()
       times.append(end - start)
   
   info = fs.get_file_info(f"{bucket_name}/my_test_file.bin")
   assert info.size == piece_size * pieces
   
   avg_time = sum(times) / len(times)
   print(f"Upload 200MB data in {pieces} pieces ({piece_size / 1024 / 1024} MB 
per piece).")
   print(f"Avg: {avg_time} Min: {min(times)} Max: {max(times)}")
   ```
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to