[jira] [Commented] (ARROW-7244) Inconsistent behavior with reading in S3 parquet objects

William Tardio (Jira) Fri, 22 Nov 2019 12:19:29 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-7244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16980485#comment-16980485
 ]


William Tardio commented on ARROW-7244:
---------------------------------------

I have narrowed down when it fails and when it doesn't, however it doesn't 
reveal why it is doing this.

 

When a lambda runs and loads the pyarrrow and all its dependencies into the 
root python directory, it always works 100% of the time. If the lambda spins 
down and I run it again (waiting long enough to get a cold start again), then 
it works 100% of the time.

 

If I do multiple executions back to back, whereas it uses the same "hot" 
environment, it works for the first execution (immediately after the "cold 
start") but then fails every time after that with new parquet files coming in. 
Very strange.

> Inconsistent behavior with reading in S3 parquet objects
> --------------------------------------------------------
>
>                 Key: ARROW-7244
>                 URL: https://issues.apache.org/jira/browse/ARROW-7244
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.15.1
>         Environment: running in a lambda, compiled on an EC2 using linux
>            Reporter: William Tardio
>            Priority: Blocker
>
> We are piloting using pyarrow to reaching parquet files from AWS S3.
>  
> We got it working in combination with s3fs as the filesystem. However, we are 
> seeing very inconsistent results when reading in parquet objects with
> s3=s3fs.S3FileSystem()
> ParquetDataset(url, filesystem=s3)
>  
> The read inconsistently throws this error:
>  
> [ERROR] OSError: Passed non-file path: 
> s3://bucket/schedule/sxaup/fms_db_aub/adn_master/trunc/20191122024436.parquet
> Traceback (most recent call last):
>   File "/var/task/file_check.py", line 35, in lambda_handler
>     main(event, context)
>   File "/var/task/file_check.py", line 260, in main
>     validate_resp['object_type'])
>   File "/opt/python/utils.py", line 80, in schema_check
>     stage_pya_dataset = ParquetDataset(full_URL_stage, filesystem=s3)
>   File "/opt/python/lib/python3.7/site-packages/pyarrow/parquet.py", line 
> 1030, in __init__
>     open_file_func=partial(_open_dataset_file, self._metadata)
>   File "/opt/python/lib/python3.7/site-packages/pyarrow/parquet.py", line 
> 1229, in _make_manifest
>     .format(path))
>  
> As you can see, the path is valid and sometimes works, others times does not 
> (no modification of the file between those successful and error runs). Does 
> ParquetDataset actually open the file and validate it and so the error is in 
> regards to the data?
>  
> Willing to do any troubleshooting for get this solved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7244) Inconsistent behavior with reading in S3 parquet objects

Reply via email to