[jira] [Created] (ARROW-11583) [C++] Filesystem aware disk scheduling

Weston Pace (Jira) Tue, 09 Feb 2021 20:15:07 -0800

Weston Pace created ARROW-11583:
-----------------------------------

             Summary: [C++] Filesystem aware disk scheduling
                 Key: ARROW-11583
                 URL: https://issues.apache.org/jira/browse/ARROW-11583
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Weston Pace



Different filesystems have different ideal access strategies.  For example:

AWS: Unlimited parallelism?, no penalty for random?
AWS EBS: Depends
SSD: Bounded parallelism (# of hw contexts), penalty for random within context.
HDD: Very limited parallelism (1 usually), penalty for random access

Currently, Arrow does not factor these access strategies into I/O scheduling.  
For example, when reading a dataset of 100 files then it will start reading X 
files at once (where X is the parallelism of the thread pool).  For AWS this is 
ideal.  For an HDD this is not.

The OS does have a scheduler which attempts to mitigate this.  It does not know 
the scope of the I/O and the dependencies amongt the I/O (e.g. in the above 
dataset read example it's better to read X quickly and then Y quickly than it 
is to read X and Y slowly at the same time).  I've run some experiments (see 
comment) which show the OS scheduler fails to achieve ideal performance in 
fairly typical cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-11583) [C++] Filesystem aware disk scheduling

Reply via email to