[
https://issues.apache.org/jira/browse/ARROW-7244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091841#comment-17091841
]
Harini Kannan commented on ARROW-7244:
--------------------------------------
Any update on this ? I'm seeing the same error pop up randomly when I have a
lambda function triggering on new parquet files in an S3 bucket which reads the
parquet files using ParquetDataset(). Or is there any workaround for this ?
> [Python] Inconsistent behavior with reading in S3 parquet objects
> -----------------------------------------------------------------
>
> Key: ARROW-7244
> URL: https://issues.apache.org/jira/browse/ARROW-7244
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.15.1
> Environment: running in a lambda, compiled on an EC2 using linux
> Reporter: William Tardio
> Priority: Major
>
> We are piloting using pyarrow to reaching parquet files from AWS S3.
>
> We got it working in combination with s3fs as the filesystem. However, we are
> seeing very inconsistent results when reading in parquet objects with
> s3=s3fs.S3FileSystem()
> ParquetDataset(url, filesystem=s3)
>
> The read inconsistently throws this error:
>
> [ERROR] OSError: Passed non-file path:
> s3://bucket/schedule/sxaup/fms_db_aub/adn_master/trunc/20191122024436.parquet
> Traceback (most recent call last):
> File "/var/task/file_check.py", line 35, in lambda_handler
> main(event, context)
> File "/var/task/file_check.py", line 260, in main
> validate_resp['object_type'])
> File "/opt/python/utils.py", line 80, in schema_check
> stage_pya_dataset = ParquetDataset(full_URL_stage, filesystem=s3)
> File "/opt/python/lib/python3.7/site-packages/pyarrow/parquet.py", line
> 1030, in __init__
> open_file_func=partial(_open_dataset_file, self._metadata)
> File "/opt/python/lib/python3.7/site-packages/pyarrow/parquet.py", line
> 1229, in _make_manifest
> .format(path))
>
> As you can see, the path is valid and sometimes works, others times does not
> (no modification of the file between those successful and error runs). Does
> ParquetDataset actually open the file and validate it and so the error is in
> regards to the data?
>
> Willing to do any troubleshooting for get this solved.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)