westonpace commented on issue #34118: URL: https://github.com/apache/arrow/issues/34118#issuecomment-1426262379
> When calling DoInitializeS3, arrow creates initialises the AWS API, which by default creates a thread pool for the background AWS event loop that uses one thread per physical core on the system. I thought the default behavior was for AWS to [not use a pool at all and spin up a brand new detached thread per-request](https://aws.amazon.com/blogs/developer/using-a-thread-pool-with-the-aws-sdk-for-c/) but that article is pretty old so maybe it is no longer the behavior. Furthermore, the [docs](https://awslabs.github.io/aws-crt-cpp/class_aws_1_1_crt_1_1_io_1_1_event_loop_group.html) state "which will create one for each processor on the machine." Perhaps it is a typo on their part but unless you have a multi-CPU machine (e.g. NUMA) I would expect this to use a single thread (and it would be weird if their default went against their recommendations). Although, looking at the linked issue, it does indeed seem to be a lot of threads. And...after further debugging...it does seem to be thread per physical core on my system. > This is rather unfriendly when running a multi-process or some otherhow parallelised process on a multicore box since it leads to oversubscription. I wouldn't be terribly worried about this. I expect these threads will spend the majority of their time in a blocked state, nonscheduled by the OS. I agree there is some minor hit to having more threads than you need but this isn't the more significant hit you get by over-scheduling CPU threads which leads to an excess of context switches. > It would be nice if there were a way to control the size of this thread pool Agreed, there is already `arrow::fs::S3GlobalOptions` so we have some precedent. I don't know if there are python bindings and it seems we need to add an "event loop thread pool count" to the mix. > I think the following diff is kind of a sketch in this direction, although it just unilaterally sets the size of the thread pool available to a single thread. It sounds like it would be a good idea in general to change the default to 1 anyways. Though this could use some benchmarking. > Aside: AFAICT there's no programmatic way of control arrow's thread pool size, it must be done via environment variables, which is also rather unfriendly Do you want to open a separate issue for this? Seems like a reasonable request. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
