westonpace commented on issue #34118:
URL: https://github.com/apache/arrow/issues/34118#issuecomment-1426262379

   > When calling DoInitializeS3, arrow creates initialises the AWS API, which 
by default creates a thread pool for the background AWS event loop that uses 
one thread per physical core on the system.
   
   I thought the default behavior was for AWS to [not use a pool at all and 
spin up a brand new detached thread 
per-request](https://aws.amazon.com/blogs/developer/using-a-thread-pool-with-the-aws-sdk-for-c/)
 but that article is pretty old so maybe it is no longer the behavior.
   
   Furthermore, the 
[docs](https://awslabs.github.io/aws-crt-cpp/class_aws_1_1_crt_1_1_io_1_1_event_loop_group.html)
 state "which will create one for each processor on the machine."  Perhaps it 
is a typo on their part but unless you have a multi-CPU machine (e.g. NUMA) I 
would expect this to use a single thread (and it would be weird if their 
default went against their recommendations).  Although, looking at the linked 
issue, it does indeed seem to be a lot of threads.  And...after further 
debugging...it does seem to be thread per physical core on my system.
   
   > This is rather unfriendly when running a multi-process or some otherhow 
parallelised process on a multicore box since it leads to oversubscription.
   
   I wouldn't be terribly worried about this.  I expect these threads will 
spend the majority of their time in a blocked state, nonscheduled by the OS.  I 
agree there is some minor hit to having more threads than you need but this 
isn't the more significant hit you get by over-scheduling CPU threads which 
leads to an excess of context switches.
   
   > It would be nice if there were a way to control the size of this thread 
pool
   
   Agreed, there is already `arrow::fs::S3GlobalOptions` so we have some 
precedent.  I don't know if there are python bindings and it seems we need to 
add an "event loop thread pool count" to the mix.
   
   > I think the following diff is kind of a sketch in this direction, although 
it just unilaterally sets the size of the thread pool available to a single 
thread.
   
   It sounds like it would be a good idea in general to change the default to 1 
anyways.  Though this could use some benchmarking.
   
   > Aside: AFAICT there's no programmatic way of control arrow's thread pool 
size, it must be done via environment variables, which is also rather unfriendly
   
   Do you want to open a separate issue for this?  Seems like a reasonable 
request. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to