[jira] [Commented] (ARROW-7244) Inconsistent behavior with reading in S3 parquet objects

Wes McKinney (Jira) Fri, 22 Nov 2019 12:20:30 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-7244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16980486#comment-16980486
 ]


Wes McKinney commented on ARROW-7244:
-------------------------------------

I can't comment directly on the s3fs interop question. Our general plan moving 
forward is to use our native C++ S3 implementation, but this is not ready for 
production yet as far as I'm aware

Please note that this library is not at all optimized (yet) for Amazon S3 so 
please do not draw any conclusions regarding performance or scalability based 
on your experiments. For example, the new behavior discussed in PARQUET-1698 
will make a big difference for S3 users

cc [~fsaintjacques] [~apitrou]

> Inconsistent behavior with reading in S3 parquet objects
> --------------------------------------------------------
>
>                 Key: ARROW-7244
>                 URL: https://issues.apache.org/jira/browse/ARROW-7244
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.15.1
>         Environment: running in a lambda, compiled on an EC2 using linux
>            Reporter: William Tardio
>            Priority: Blocker
>
> We are piloting using pyarrow to reaching parquet files from AWS S3.
>  
> We got it working in combination with s3fs as the filesystem. However, we are 
> seeing very inconsistent results when reading in parquet objects with
> s3=s3fs.S3FileSystem()
> ParquetDataset(url, filesystem=s3)
>  
> The read inconsistently throws this error:
>  
> [ERROR] OSError: Passed non-file path: 
> s3://bucket/schedule/sxaup/fms_db_aub/adn_master/trunc/20191122024436.parquet
> Traceback (most recent call last):
>   File "/var/task/file_check.py", line 35, in lambda_handler
>     main(event, context)
>   File "/var/task/file_check.py", line 260, in main
>     validate_resp['object_type'])
>   File "/opt/python/utils.py", line 80, in schema_check
>     stage_pya_dataset = ParquetDataset(full_URL_stage, filesystem=s3)
>   File "/opt/python/lib/python3.7/site-packages/pyarrow/parquet.py", line 
> 1030, in __init__
>     open_file_func=partial(_open_dataset_file, self._metadata)
>   File "/opt/python/lib/python3.7/site-packages/pyarrow/parquet.py", line 
> 1229, in _make_manifest
>     .format(path))
>  
> As you can see, the path is valid and sometimes works, others times does not 
> (no modification of the file between those successful and error runs). Does 
> ParquetDataset actually open the file and validate it and so the error is in 
> regards to the data?
>  
> Willing to do any troubleshooting for get this solved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7244) Inconsistent behavior with reading in S3 parquet objects

Reply via email to