[ 
https://issues.apache.org/jira/browse/ARROW-14354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429852#comment-17429852
 ] 

Weston Pace commented on ARROW-14354:
-------------------------------------

Yes, this stems from my initial stab at TPC-H profiling.  I was running a basic 
scan->filter->project query through the exec plan and for both parquet and IPC 
formats the bottleneck was RAM (on my system at least).  I was using cached 
files so no actual I/O needed to be done.

Right now, the parquet format is doing its reads on the CPU thread pool (a 
potentially separate problem) so as you adjust the size of the CPU thread pool 
the wall clock time doesn't change (past 4 threads when the RAM bottleneck is 
hit) but the process time does.  For example, running with 16 thread both Intel 
and perf report an average core utilization of ~14.5.  Running with 4 threads I 
get an average core utilization of ~4 and the wall clock time is the same for 
both.  With IPC things are a little different because the IPC file format is 
correctly using the I/O executor.  So regardless of what I set the CPU thread 
pool count to (as long as it is above 4) the core utilization is ~8 and the 
wall clock time is the same.

I do think we can probably set this as a filesystem property.  I don't really 
have enough experience with different disks (e.g. supposedly some SSDs have 
decent support for parallel reads) but we probably don't need very many threads 
for a local filesystem.

On that note:  For datasets, the IOContext (and thus the IO executor) is 
currently passed in via scan options.  Should this be obtained from the 
filesystem instead?



> [C++] Investigate reducing I/O thread pool size to avoid CPU wastage.
> ---------------------------------------------------------------------
>
>                 Key: ARROW-14354
>                 URL: https://issues.apache.org/jira/browse/ARROW-14354
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Weston Pace
>            Priority: Major
>
> If we are reading over HTTP (e.g. S3) we generally want high parallelism in 
> the I/O thread pool.
> If we are reading from disk then high parallelism is usually harmless but 
> ineffective.  Most of the I/O threads will spend their time in a waiting 
> state and the cores can be used for other work.
> However, it appears that when we are reading locally, and the data is cached 
> in memory, then having too much parallelism will be harmful, but some 
> parallelism is beneficial.  Once the DRAM <-> CPU bandwidth limit is hit then 
> all reading threads will experience high DRAM latency.  Unlike an I/O 
> bottleneck a RAM bottleneck will waste cycles on the physical core.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to