[ 
https://issues.apache.org/jira/browse/ARROW-17481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ziheng Wang updated ARROW-17481:
--------------------------------
    Description: 
The current dataset reader for CSV is pretty slow on EC2 reading from S3.

EC2 instances have more than 3Gbps network bandwidth which make them on par 
with SSD. However reading batches from disk is more than 3x faster than reading 
from network. This should not happen.

The reason why the dataset reader is not fully leveraging the network bandwidth 
is because reads are currently serial. We should change the reads to be 
parallel. Then even if the rest of the pipeline is not parallel we should get 
same read speed as disk.

Note one might think that if you have many fragments fragment-level parallelism 
will take care of this. This is true to some extent however to_batches() is 
ordered. This means that if your fragments are big the fragment readahead will 
stop being effective after a while as the reader tries to deplete the fragments 
in order. The batch readahead for the CSV reader current is a serial readahead, 
which really should be a parallel readahead.

After changing the network IO to be parallel, we should also change the parse 
and decode to be parallel. It's easy to change the parse to be parallel, a bit 
harder for the decode because of how the decoder operator works, so I will just 
tackle the parse first.

On my test system (i3.2xlarge on EC2 reading from S3 one large CSV), these 
changes (parallel reading and parallel parsing) made reading 60 batches (~10GB) 
4x faster. Note these changes will also make disk reading faster due to 
parallel parse.

  was:
The current dataset reader for CSV is pretty slow on EC2 reading from S3.

EC2 instances have more than 3Gbps network bandwidth which make them on par 
with SSD. However reading batches from disk is more than 3x faster than reading 
from network. This should not happen.

The reason why the dataset reader is not fully leveraging the network bandwidth 
is because reads are currently serial. We should change the reads to be 
parallel. Then even if the rest of the pipeline is not parallel we should get 
same read speed as disk.

Note one might think that if you have many fragments fragment-level parallelism 
will take care of this. This is true to some extent however to_batches() is 
ordered. This means that if your fragments are big the fragment readahead will 
stop being effective after a while as the reader tries to deplete the fragments 
in order. The batch readahead for the CSV reader current is a serial readahead, 
which really should be a parallel readahead.

After changing the network IO to be parallel, we should also change the parse 
and decode to be parallel. It's easy to change the parse to be parallel, a bit 
harder for the decode because of how the decoder operator works, so I will just 
tackle the parse first.

On my test system (i3.2xlarge on EC2 reading from S3 one large CSV), these 
changes (parallel ) made reading 60 batches (~10GB) 4x faster.


> Major performance improvements to CSV reading from S3
> -----------------------------------------------------
>
>                 Key: ARROW-17481
>                 URL: https://issues.apache.org/jira/browse/ARROW-17481
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Python
>            Reporter: Ziheng Wang
>            Assignee: Ziheng Wang
>            Priority: Major
>
> The current dataset reader for CSV is pretty slow on EC2 reading from S3.
> EC2 instances have more than 3Gbps network bandwidth which make them on par 
> with SSD. However reading batches from disk is more than 3x faster than 
> reading from network. This should not happen.
> The reason why the dataset reader is not fully leveraging the network 
> bandwidth is because reads are currently serial. We should change the reads 
> to be parallel. Then even if the rest of the pipeline is not parallel we 
> should get same read speed as disk.
> Note one might think that if you have many fragments fragment-level 
> parallelism will take care of this. This is true to some extent however 
> to_batches() is ordered. This means that if your fragments are big the 
> fragment readahead will stop being effective after a while as the reader 
> tries to deplete the fragments in order. The batch readahead for the CSV 
> reader current is a serial readahead, which really should be a parallel 
> readahead.
> After changing the network IO to be parallel, we should also change the parse 
> and decode to be parallel. It's easy to change the parse to be parallel, a 
> bit harder for the decode because of how the decoder operator works, so I 
> will just tackle the parse first.
> On my test system (i3.2xlarge on EC2 reading from S3 one large CSV), these 
> changes (parallel reading and parallel parsing) made reading 60 batches 
> (~10GB) 4x faster. Note these changes will also make disk reading faster due 
> to parallel parse.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to