[
https://issues.apache.org/jira/browse/ARROW-14429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17433871#comment-17433871
]
David Li commented on ARROW-14429:
----------------------------------
Thanks for the update.
As mentioned, the PR should speed things up somewhat as well. If you are
accessing random record batches in the file, this is the best it can do.
However, if you are sequentially scanning through all the record batches in the
file, the record batch generator in the PR is likely the fastest - it will read
from S3 in chunks to avoid the latency of making requests for each record
batch. (It does use more memory though - please follow up if you find that to
be an issue.)
> [C++] RecordBatchFileReader performance really bad in S3
> --------------------------------------------------------
>
> Key: ARROW-14429
> URL: https://issues.apache.org/jira/browse/ARROW-14429
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++
> Affects Versions: 5.0.0
> Reporter: Lingkai Kong
> Assignee: David Li
> Priority: Major
> Labels: pull-request-available
> Fix For: 7.0.0
>
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> We are using RecordBatchFileWriter to write Arrow type directly to S3 using
> the S3FileSystem, then using RecordBatchFileReader to read from S3. The write
> is pretty efficient, write a 50MB finishes within 0.2s. But reading that file
> is taking 30s, which is definitely too long. Then I did several tests:
> # I tried to use S3FileSystem to read the file into bytes, it's only taking
> 1s. which somehow makes me believe it's an issue with RecordBatchFileReader
> # Half the size (around 25MB), with
> [RecordBatchFileReader|https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchFileReader.html]
> took 17s, without
> [RecordBatchFileReader|https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchFileReader.html]
> took 0.28s
> # Double the size (around 100MB), with
> [RecordBatchFileReader|https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchFileReader.html]
> took 61s, without
> [RecordBatchFileReader|https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchFileReader.html]
> took 2.3s
> # I tried to get all bytes using S3FileSystem first, then create a reader
> from the bytes. Then read all context from the reader, it's only taking 0.1s.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)