[
https://issues.apache.org/jira/browse/ARROW-7244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16980486#comment-16980486
]
Wes McKinney commented on ARROW-7244:
-------------------------------------
I can't comment directly on the s3fs interop question. Our general plan moving
forward is to use our native C++ S3 implementation, but this is not ready for
production yet as far as I'm aware
Please note that this library is not at all optimized (yet) for Amazon S3 so
please do not draw any conclusions regarding performance or scalability based
on your experiments. For example, the new behavior discussed in PARQUET-1698
will make a big difference for S3 users
cc [~fsaintjacques] [~apitrou]
> Inconsistent behavior with reading in S3 parquet objects
> --------------------------------------------------------
>
> Key: ARROW-7244
> URL: https://issues.apache.org/jira/browse/ARROW-7244
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.15.1
> Environment: running in a lambda, compiled on an EC2 using linux
> Reporter: William Tardio
> Priority: Blocker
>
> We are piloting using pyarrow to reaching parquet files from AWS S3.
>
> We got it working in combination with s3fs as the filesystem. However, we are
> seeing very inconsistent results when reading in parquet objects with
> s3=s3fs.S3FileSystem()
> ParquetDataset(url, filesystem=s3)
>
> The read inconsistently throws this error:
>
> [ERROR] OSError: Passed non-file path:
> s3://bucket/schedule/sxaup/fms_db_aub/adn_master/trunc/20191122024436.parquet
> Traceback (most recent call last):
> File "/var/task/file_check.py", line 35, in lambda_handler
> main(event, context)
> File "/var/task/file_check.py", line 260, in main
> validate_resp['object_type'])
> File "/opt/python/utils.py", line 80, in schema_check
> stage_pya_dataset = ParquetDataset(full_URL_stage, filesystem=s3)
> File "/opt/python/lib/python3.7/site-packages/pyarrow/parquet.py", line
> 1030, in __init__
> open_file_func=partial(_open_dataset_file, self._metadata)
> File "/opt/python/lib/python3.7/site-packages/pyarrow/parquet.py", line
> 1229, in _make_manifest
> .format(path))
>
> As you can see, the path is valid and sometimes works, others times does not
> (no modification of the file between those successful and error runs). Does
> ParquetDataset actually open the file and validate it and so the error is in
> regards to the data?
>
> Willing to do any troubleshooting for get this solved.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)