[ https://issues.apache.org/jira/browse/ARROW-7244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Antoine Pitrou closed ARROW-7244. --------------------------------- Resolution: Cannot Reproduce We are using our own S3 filesystem implementation now, so I'm assuming this issue no longer exists. If you still encounter it (or a similar one) with a recent version of Arrow, please open a new JIRA! > [Python] Inconsistent behavior with reading in S3 parquet objects > ----------------------------------------------------------------- > > Key: ARROW-7244 > URL: https://issues.apache.org/jira/browse/ARROW-7244 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.15.1 > Environment: running in a lambda, compiled on an EC2 using linux > Reporter: William Tardio > Priority: Major > > We are piloting using pyarrow to reaching parquet files from AWS S3. > > We got it working in combination with s3fs as the filesystem. However, we are > seeing very inconsistent results when reading in parquet objects with > s3=s3fs.S3FileSystem() > ParquetDataset(url, filesystem=s3) > > The read inconsistently throws this error: > > [ERROR] OSError: Passed non-file path: > s3://bucket/schedule/sxaup/fms_db_aub/adn_master/trunc/20191122024436.parquet > Traceback (most recent call last): > File "/var/task/file_check.py", line 35, in lambda_handler > main(event, context) > File "/var/task/file_check.py", line 260, in main > validate_resp['object_type']) > File "/opt/python/utils.py", line 80, in schema_check > stage_pya_dataset = ParquetDataset(full_URL_stage, filesystem=s3) > File "/opt/python/lib/python3.7/site-packages/pyarrow/parquet.py", line > 1030, in __init__ > open_file_func=partial(_open_dataset_file, self._metadata) > File "/opt/python/lib/python3.7/site-packages/pyarrow/parquet.py", line > 1229, in _make_manifest > .format(path)) > > As you can see, the path is valid and sometimes works, others times does not > (no modification of the file between those successful and error runs). Does > ParquetDataset actually open the file and validate it and so the error is in > regards to the data? > > Willing to do any troubleshooting for get this solved. -- This message was sent by Atlassian Jira (v8.3.4#803005)