[ 
https://issues.apache.org/jira/browse/ARROW-17481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-17481:
-----------------------------------
    Fix Version/s: 11.0.0

> [C++] [Python] Major performance improvements to CSV reading from S3
> --------------------------------------------------------------------
>
>                 Key: ARROW-17481
>                 URL: https://issues.apache.org/jira/browse/ARROW-17481
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Python
>            Reporter: Ziheng Wang
>            Assignee: Ziheng Wang
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 11.0.0
>
>          Time Spent: 6h 40m
>  Remaining Estimate: 0h
>
> The current dataset reader for CSV is pretty slow on EC2 reading from S3.
> EC2 instances have more than 3Gbps network bandwidth which make them on par 
> with SSD. However reading batches from disk is more than 3x faster than 
> reading from network. This should not happen.
> The reason why the dataset reader is not fully leveraging the network 
> bandwidth is because reads are currently serial. We should change the reads 
> to be parallel. Then even if the rest of the pipeline is not parallel we 
> should get same read speed as disk.
> Note one might think that if you have many fragments fragment-level 
> parallelism will take care of this. This is true to some extent however 
> to_batches() is ordered. This means that if your fragments are big the 
> fragment readahead will stop being effective after a while as the reader 
> tries to deplete the fragments in order. The batch readahead for the CSV 
> reader current is a serial readahead, which really should be a parallel 
> readahead.
> After changing the network IO to be parallel, we should also change the parse 
> and decode to be parallel. It's easy to change the parse to be parallel, a 
> bit harder for the decode because of how the decoder operator works, so I 
> will just tackle the parse first.
> On my test system (i3.2xlarge on EC2 reading from S3 one large CSV), these 
> changes (parallel reading and parallel parsing) made reading 60 batches 
> (~10GB) 4x faster. Note these changes will also make disk reading faster due 
> to parallel parse.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to