Ziheng Wang created ARROW-17481:
-----------------------------------

             Summary: Major performance improvements to CSV reading from S3
                 Key: ARROW-17481
                 URL: https://issues.apache.org/jira/browse/ARROW-17481
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++, Python
            Reporter: Ziheng Wang
            Assignee: Ziheng Wang


The current dataset reader for CSV is pretty slow on EC2 reading from S3.

EC2 instances have more than 3Gbps network bandwidth which make them on par 
with SSD. However reading batches from disk is more than 3x faster than reading 
from network. This should not happen.

The reason why the dataset reader is not fully leveraging the network bandwidth 
is because reads are currently serial. We should change the reads to be 
parallel. Then even if the rest of the pipeline is not parallel we should get 
same read speed as disk.

Note one might think that if you have many fragments fragment-level parallelism 
will take care of this. This is true to some extent however to_batches() is 
ordered. This means that if your fragments are big the fragment readahead will 
stop being effective after a while as the reader tries to deplete the fragments 
in order. The batch readahead for the CSV reader current is a serial readahead, 
which really should be a parallel readahead.

After changing the network IO to be parallel, we should also change the parse 
and decode to be parallel. It's easy to change the parse to be parallel, a bit 
harder for the decode because of how the decoder operator works, so I will just 
tackle the parse first.

On my test system (i3.2xlarge on EC2 reading from S3 one large CSV), these 
changes (parallel ) made reading 60 batches (~10GB) 4x faster.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to