[
https://issues.apache.org/jira/browse/ARROW-17481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antoine Pitrou updated ARROW-17481:
-----------------------------------
Fix Version/s: 11.0.0
> [C++] [Python] Major performance improvements to CSV reading from S3
> --------------------------------------------------------------------
>
> Key: ARROW-17481
> URL: https://issues.apache.org/jira/browse/ARROW-17481
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++, Python
> Reporter: Ziheng Wang
> Assignee: Ziheng Wang
> Priority: Major
> Labels: pull-request-available
> Fix For: 11.0.0
>
> Time Spent: 6h 40m
> Remaining Estimate: 0h
>
> The current dataset reader for CSV is pretty slow on EC2 reading from S3.
> EC2 instances have more than 3Gbps network bandwidth which make them on par
> with SSD. However reading batches from disk is more than 3x faster than
> reading from network. This should not happen.
> The reason why the dataset reader is not fully leveraging the network
> bandwidth is because reads are currently serial. We should change the reads
> to be parallel. Then even if the rest of the pipeline is not parallel we
> should get same read speed as disk.
> Note one might think that if you have many fragments fragment-level
> parallelism will take care of this. This is true to some extent however
> to_batches() is ordered. This means that if your fragments are big the
> fragment readahead will stop being effective after a while as the reader
> tries to deplete the fragments in order. The batch readahead for the CSV
> reader current is a serial readahead, which really should be a parallel
> readahead.
> After changing the network IO to be parallel, we should also change the parse
> and decode to be parallel. It's easy to change the parse to be parallel, a
> bit harder for the decode because of how the decoder operator works, so I
> will just tackle the parse first.
> On my test system (i3.2xlarge on EC2 reading from S3 one large CSV), these
> changes (parallel reading and parallel parsing) made reading 60 batches
> (~10GB) 4x faster. Note these changes will also make disk reading faster due
> to parallel parse.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)