[jira] [Updated] (ARROW-18076) [Python] PyArrow cannot read from R2 (Cloudflare's S3)

Kouhei Sutou (Jira) Mon, 17 Oct 2022 14:24:03 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-18076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Kouhei Sutou updated ARROW-18076:
---------------------------------
    Component/s: Python

> [Python] PyArrow cannot read from R2 (Cloudflare's S3)
> ------------------------------------------------------
>
>                 Key: ARROW-18076
>                 URL: https://issues.apache.org/jira/browse/ARROW-18076
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>         Environment: Ubuntu 20
>            Reporter: Vedant Roy
>            Priority: Major
>
> When using pyarrow to read parquet data (as part of the Ray project), I get 
> the following stracktrace:
> ```
> (_sample_piece pid=49818) Traceback (most recent call last):
> (_sample_piece pid=49818)   File "python/ray/_raylet.pyx", line 859, in 
> ray._raylet.execute_task
> (_sample_piece pid=49818)   File "python/ray/_raylet.pyx", line 863, in 
> ray._raylet.execute_task
> (_sample_piece pid=49818)   File 
> "/home/ray/anaconda3/lib/python3.8/site-packages/ray/data/datasource/parquet_datasource.py",
>  line 446, in _sample_piece
> (_sample_piece pid=49818)     batch = next(batches)
> (_sample_piece pid=49818)   File "pyarrow/_dataset.pyx", line 3202, in 
> _iterator
> (_sample_piece pid=49818)   File "pyarrow/_dataset.pyx", line 2891, in 
> pyarrow._dataset.TaggedRecordBatchIterator.__next__
> (_sample_piece pid=49818)   File "pyarrow/error.pxi", line 143, in 
> pyarrow.lib.pyarrow_internal_check_status
> (_sample_piece pid=49818)   File "pyarrow/error.pxi", line 114, in 
> pyarrow.lib.check_status
> (_sample_piece pid=49818) OSError: AWS Error [code 99]: curlCode: 18, 
> Transferred a partial file
> ```
> I do not get this error when using Amazon S3 for the exact same data.
> The error is coming from this line:
> https://github.com/ray-project/ray/blob/6fb605379a726d889bd25cf0ee4ed335c74408ff/python/ray/data/datasource/parquet_datasource.py#L446



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-18076) [Python] PyArrow cannot read from R2 (Cloudflare's S3)

Reply via email to