[
https://issues.apache.org/jira/browse/ARROW-14974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452704#comment-17452704
]
Weston Pace commented on ARROW-14974:
-------------------------------------
Note: The inverse is often true. We will sometimes do some compute work on the
I/O thread pool to avoid a potential loss of cache coherency or creating too
many thread tasks. But this is more tolerable.
> [C++] Dataset scanning, in async mode, is running parquet reads on the CPU
> thread pool
> --------------------------------------------------------------------------------------
>
> Key: ARROW-14974
> URL: https://issues.apache.org/jira/browse/ARROW-14974
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Weston Pace
> Priority: Major
>
> This is something I picked up while doing some profiling a while back. When
> running a scan of a large parquet dataset many of the read tasks (e.g. I/O
> reads) were running on the CPU thread pool. This could lead to the CPU
> thread pool being underutilized.
> It might not have a large effect on the parquet read itself (if the reads are
> slow we are probably I/O bound so one might not notice) but it can cause
> issues on a more complex query where reading is being interleaved with CPU
> work (like filtering and joining).
--
This message was sent by Atlassian Jira
(v8.20.1#820001)