[
https://issues.apache.org/jira/browse/ARROW-14354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429852#comment-17429852
]
Weston Pace commented on ARROW-14354:
-------------------------------------
Yes, this stems from my initial stab at TPC-H profiling. I was running a basic
scan->filter->project query through the exec plan and for both parquet and IPC
formats the bottleneck was RAM (on my system at least). I was using cached
files so no actual I/O needed to be done.
Right now, the parquet format is doing its reads on the CPU thread pool (a
potentially separate problem) so as you adjust the size of the CPU thread pool
the wall clock time doesn't change (past 4 threads when the RAM bottleneck is
hit) but the process time does. For example, running with 16 thread both Intel
and perf report an average core utilization of ~14.5. Running with 4 threads I
get an average core utilization of ~4 and the wall clock time is the same for
both. With IPC things are a little different because the IPC file format is
correctly using the I/O executor. So regardless of what I set the CPU thread
pool count to (as long as it is above 4) the core utilization is ~8 and the
wall clock time is the same.
I do think we can probably set this as a filesystem property. I don't really
have enough experience with different disks (e.g. supposedly some SSDs have
decent support for parallel reads) but we probably don't need very many threads
for a local filesystem.
On that note: For datasets, the IOContext (and thus the IO executor) is
currently passed in via scan options. Should this be obtained from the
filesystem instead?
> [C++] Investigate reducing I/O thread pool size to avoid CPU wastage.
> ---------------------------------------------------------------------
>
> Key: ARROW-14354
> URL: https://issues.apache.org/jira/browse/ARROW-14354
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Weston Pace
> Priority: Major
>
> If we are reading over HTTP (e.g. S3) we generally want high parallelism in
> the I/O thread pool.
> If we are reading from disk then high parallelism is usually harmless but
> ineffective. Most of the I/O threads will spend their time in a waiting
> state and the cores can be used for other work.
> However, it appears that when we are reading locally, and the data is cached
> in memory, then having too much parallelism will be harmful, but some
> parallelism is beneficial. Once the DRAM <-> CPU bandwidth limit is hit then
> all reading threads will experience high DRAM latency. Unlike an I/O
> bottleneck a RAM bottleneck will waste cycles on the physical core.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)