pitrou commented on a change in pull request #12670: URL: https://github.com/apache/arrow/pull/12670#discussion_r833072957
########## File path: docs/source/cpp/threading.rst ########## @@ -0,0 +1,100 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +.. default-domain:: cpp +.. highlight:: cpp + +.. _cpp_thread_management: + +================= +Thread Management +================= + +.. seealso:: + :doc:`Thread management API reference <api/thread>` + +Thread Pools +======================= + +Many Arrow operations distribute work across multiple threads to take +advantage of underlying hardware parallelism. For example, when reading a +parquet file we can decode each column in parallel. To achieve this we +submit tasks to an executor of some kind. + +Within Arrow we use thread pools for parallel scheduling and an event loop +when the user has requested serial execution. It is possible for +users to provide their own custom implementation, though that is an advanced +concept and not covered here. + +CPU vs. I/O +----------- + +In order to minimize the overhead of context switches our default thread pool +for CPU-intensive tasks has a fixed size, defaulting to +`std::thread::hardware_concurrency <https://en.cppreference.com/w/cpp/thread/thread/hardware_concurrency>`_. +This means that CPU tasks should never block for long periods of time because this +will result in under-utilization of the CPU. To achieve this we have a separate +thread pool which should be used for tasks that need to block. Since these tasks +are usually associated with I/O operations we call this the I/O thread pool. This +model is often associated with asynchronous computation. + +The size of the I/O thread pool currently defaults to 8 threads and should +be sized according to the parallel capabilities of the I/O hardware. For example, +if most reads and writes occur on a typical HDD then the default of 8 will probably +be sufficient. On the other hand, when most reads and writes occur on a remote +filesystem such as S3, it is often possible to benefit from many concurrent reads +and it may be possible to increase I/O performance by increasing the size of the +I/O thread pool. The size of the default I/O thread pool can be managed with +the :ref:`ARROW_IO_THREADS<env_arrow_io_threads>` environment variable or Review comment: Can simply use the ```:envvar:`ARROW_IO_THREADS` ``` markup (just like for classes, functions, etc.). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
