[
https://issues.apache.org/jira/browse/ARROW-14429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17432970#comment-17432970
]
David Li commented on ARROW-14429:
----------------------------------
[~lingkai2] the size of the record batches in the file is determined when you
write the file. How is the file being written?
Also I would note that if you are reading the entire file, and the file is
relatively small, it will be hard to do much better than just reading the
entire thing into memory first.
> [Python] RecordBatchFileReader performance really bad in S3
> -----------------------------------------------------------
>
> Key: ARROW-14429
> URL: https://issues.apache.org/jira/browse/ARROW-14429
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 5.0.0
> Reporter: Lingkai Kong
> Priority: Major
> Fix For: 7.0.0
>
>
> We are using RecordBatchFileWriter to write Arrow type directly to S3 using
> the S3FileSystem, then using RecordBatchFileReader to read from S3. The write
> is pretty efficient, write a 50MB finishes within 0.2s. But reading that file
> is taking 30s, which is definitely too long. Then I did several tests:
> # I tried to use S3FileSystem to read the file into bytes, it's only taking
> 1s. which somehow makes me believe it's an issue with RecordBatchFileReader
> # Half the size (around 25MB), with
> [RecordBatchFileReader|https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchFileReader.html]
> took 17s, without
> [RecordBatchFileReader|https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchFileReader.html]
> took 0.28s
> # Double the size (around 100MB), with
> [RecordBatchFileReader|https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchFileReader.html]
> took 61s, without
> [RecordBatchFileReader|https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchFileReader.html]
> took 2.3s
> # I tried to get all bytes using S3FileSystem first, then create a reader
> from the bytes. Then read all context from the reader, it's only taking 0.1s.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)