westonpace commented on issue #34892:
URL: https://github.com/apache/arrow/issues/34892#issuecomment-1538550562

   > Imagine a scenario where you have nearly continuous influx of data, which 
you need to render into parquet and store on S3. A backoff strategy works fine 
and well for a single write, but when you have loads of data incoming, if you 
get rate limited, and you backoff, you risk falling behind to a point where 
it's very difficult to catch up.
   
   > This is, of course, hypothetical, but it illustrates that whilst 
throttling and retry with backoff would be very useful for 90% of use cases 
(and I would certainly appreciate them, I just do not possess the programming 
skill to implement them here :( ), there are some niche circumstances where we 
may need to consider batching writes more efficiently.
   
   The dataset writer itself issues one "Write" call per row group.  You can 
batch those using the min_rows_per_group configuration of the call.
   
   The S3 filesystem itself will batch incoming writes until it has accumulated 
5MBs of data.  This is controlled by a constant `kMinimumPartUpload`.  Given 
that S3 is supposedly providing 5,500 requests per second that would seem to 
imply a limit of 27.5GBps which I assume is more than enough.
   
   It's also possible, if you have many partitions, and a low number of max 
open files, that many small parquet files are being created.  So you might just 
check to see if that is happening (and increase the allowed # of max open files 
if it is).
   
   Again, I think more investigation is needed.  How many writes per second are 
actually being issued?  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to