Weston Pace created ARROW-11583:
-----------------------------------
Summary: [C++] Filesystem aware disk scheduling
Key: ARROW-11583
URL: https://issues.apache.org/jira/browse/ARROW-11583
Project: Apache Arrow
Issue Type: Improvement
Components: C++
Reporter: Weston Pace
Different filesystems have different ideal access strategies. For example:
AWS: Unlimited parallelism?, no penalty for random?
AWS EBS: Depends
SSD: Bounded parallelism (# of hw contexts), penalty for random within context.
HDD: Very limited parallelism (1 usually), penalty for random access
Currently, Arrow does not factor these access strategies into I/O scheduling.
For example, when reading a dataset of 100 files then it will start reading X
files at once (where X is the parallelism of the thread pool). For AWS this is
ideal. For an HDD this is not.
The OS does have a scheduler which attempts to mitigate this. It does not know
the scope of the I/O and the dependencies amongt the I/O (e.g. in the above
dataset read example it's better to read X quickly and then Y quickly than it
is to read X and Y slowly at the same time). I've run some experiments (see
comment) which show the OS scheduler fails to achieve ideal performance in
fairly typical cases.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)