[
https://issues.apache.org/jira/browse/ARROW-18076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17623138#comment-17623138
]
Alessandro Molina commented on ARROW-18076:
-------------------------------------------
Have you tried reaching to Cloudflare to verify if it might be a problem with
the file itself? That error is usually caused by a mismatch between
{{Content-Length}} header and the actually transfered amount of bytes. In the
majority of cases the problem is caused by the server setting a wrong
{{Content-Length}} or truncating the connection. So I would check with
Cloudflare support, especially if you say that when using S3 the same file
works correctly.
I'm going to close the ticket, if you get an answer from Cloudflare confirming
that everything is fine on their side, feel free to reopen it.
> [Python] PyArrow cannot read from R2 (Cloudflare's S3)
> ------------------------------------------------------
>
> Key: ARROW-18076
> URL: https://issues.apache.org/jira/browse/ARROW-18076
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Environment: Ubuntu 20
> Reporter: Vedant Roy
> Priority: Major
>
> When using pyarrow to read parquet data (as part of the Ray project), I get
> the following stracktrace:
> {noformat}
> (_sample_piece pid=49818) Traceback (most recent call last):
> (_sample_piece pid=49818) File "python/ray/_raylet.pyx", line 859, in
> ray._raylet.execute_task
> (_sample_piece pid=49818) File "python/ray/_raylet.pyx", line 863, in
> ray._raylet.execute_task
> (_sample_piece pid=49818) File
> "/home/ray/anaconda3/lib/python3.8/site-packages/ray/data/datasource/parquet_datasource.py",
> line 446, in _sample_piece
> (_sample_piece pid=49818) batch = next(batches)
> (_sample_piece pid=49818) File "pyarrow/_dataset.pyx", line 3202, in
> _iterator
> (_sample_piece pid=49818) File "pyarrow/_dataset.pyx", line 2891, in
> pyarrow._dataset.TaggedRecordBatchIterator.__next__
> (_sample_piece pid=49818) File "pyarrow/error.pxi", line 143, in
> pyarrow.lib.pyarrow_internal_check_status
> (_sample_piece pid=49818) File "pyarrow/error.pxi", line 114, in
> pyarrow.lib.check_status
> (_sample_piece pid=49818) OSError: AWS Error [code 99]: curlCode: 18,
> Transferred a partial file
> {noformat}
> I do not get this error when using Amazon S3 for the exact same data.
> The error is coming from this line:
> https://github.com/ray-project/ray/blob/6fb605379a726d889bd25cf0ee4ed335c74408ff/python/ray/data/datasource/parquet_datasource.py#L446
--
This message was sent by Atlassian Jira
(v8.20.10#820010)