[jira] [Commented] (ARROW-7244) [Python] Inconsistent behavior with reading in S3 parquet objects

Harini Kannan (Jira) Fri, 24 Apr 2020 12:25:40 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-7244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091841#comment-17091841
 ]


Harini Kannan commented on ARROW-7244:
--------------------------------------

Any update on this ? I'm seeing the same error pop up randomly when I have a 
lambda function triggering on new parquet files in an S3 bucket which reads the 
parquet files using ParquetDataset(). Or is there any workaround for this ?

> [Python] Inconsistent behavior with reading in S3 parquet objects
> -----------------------------------------------------------------
>
>                 Key: ARROW-7244
>                 URL: https://issues.apache.org/jira/browse/ARROW-7244
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.15.1
>         Environment: running in a lambda, compiled on an EC2 using linux
>            Reporter: William Tardio
>            Priority: Major
>
> We are piloting using pyarrow to reaching parquet files from AWS S3.
>  
> We got it working in combination with s3fs as the filesystem. However, we are 
> seeing very inconsistent results when reading in parquet objects with
> s3=s3fs.S3FileSystem()
> ParquetDataset(url, filesystem=s3)
>  
> The read inconsistently throws this error:
>  
> [ERROR] OSError: Passed non-file path: 
> s3://bucket/schedule/sxaup/fms_db_aub/adn_master/trunc/20191122024436.parquet
> Traceback (most recent call last):
>   File "/var/task/file_check.py", line 35, in lambda_handler
>     main(event, context)
>   File "/var/task/file_check.py", line 260, in main
>     validate_resp['object_type'])
>   File "/opt/python/utils.py", line 80, in schema_check
>     stage_pya_dataset = ParquetDataset(full_URL_stage, filesystem=s3)
>   File "/opt/python/lib/python3.7/site-packages/pyarrow/parquet.py", line 
> 1030, in __init__
>     open_file_func=partial(_open_dataset_file, self._metadata)
>   File "/opt/python/lib/python3.7/site-packages/pyarrow/parquet.py", line 
> 1229, in _make_manifest
>     .format(path))
>  
> As you can see, the path is valid and sometimes works, others times does not 
> (no modification of the file between those successful and error runs). Does 
> ParquetDataset actually open the file and validate it and so the error is in 
> regards to the data?
>  
> Willing to do any troubleshooting for get this solved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7244) [Python] Inconsistent behavior with reading in S3 parquet objects

Reply via email to