[jira] [Closed] (ARROW-7244) [Python] Inconsistent behavior with reading in S3 parquet objects

Antoine Pitrou (Jira) Thu, 05 Aug 2021 09:09:23 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-7244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Antoine Pitrou closed ARROW-7244.
---------------------------------
    Resolution: Cannot Reproduce

We are using our own S3 filesystem implementation now, so I'm assuming this 
issue no longer exists.

If you still encounter it (or a similar one) with a recent version of Arrow, 
please open a new JIRA!

> [Python] Inconsistent behavior with reading in S3 parquet objects
> -----------------------------------------------------------------
>
>                 Key: ARROW-7244
>                 URL: https://issues.apache.org/jira/browse/ARROW-7244
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.15.1
>         Environment: running in a lambda, compiled on an EC2 using linux
>            Reporter: William Tardio
>            Priority: Major
>
> We are piloting using pyarrow to reaching parquet files from AWS S3.
>  
> We got it working in combination with s3fs as the filesystem. However, we are 
> seeing very inconsistent results when reading in parquet objects with
> s3=s3fs.S3FileSystem()
> ParquetDataset(url, filesystem=s3)
>  
> The read inconsistently throws this error:
>  
> [ERROR] OSError: Passed non-file path: 
> s3://bucket/schedule/sxaup/fms_db_aub/adn_master/trunc/20191122024436.parquet
> Traceback (most recent call last):
>   File "/var/task/file_check.py", line 35, in lambda_handler
>     main(event, context)
>   File "/var/task/file_check.py", line 260, in main
>     validate_resp['object_type'])
>   File "/opt/python/utils.py", line 80, in schema_check
>     stage_pya_dataset = ParquetDataset(full_URL_stage, filesystem=s3)
>   File "/opt/python/lib/python3.7/site-packages/pyarrow/parquet.py", line 
> 1030, in __init__
>     open_file_func=partial(_open_dataset_file, self._metadata)
>   File "/opt/python/lib/python3.7/site-packages/pyarrow/parquet.py", line 
> 1229, in _make_manifest
>     .format(path))
>  
> As you can see, the path is valid and sometimes works, others times does not 
> (no modification of the file between those successful and error runs). Does 
> ParquetDataset actually open the file and validate it and so the error is in 
> regards to the data?
>  
> Willing to do any troubleshooting for get this solved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (ARROW-7244) [Python] Inconsistent behavior with reading in S3 parquet objects

Reply via email to