[jira] [Commented] (ARROW-1213) [Python] Enable s3fs to be used with ParquetDataset and reader/writer functions

ASF GitHub Bot (JIRA) Wed, 18 Oct 2017 07:56:50 -0700

    [ 
https://issues.apache.org/jira/browse/ARROW-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16209473#comment-16209473
 ]


ASF GitHub Bot commented on ARROW-1213:
---------------------------------------

Github user DrChrisLevy commented on the issue:

    https://github.com/apache/arrow/pull/916
  
    Thanks @wesm !
    I figured it out by looking through the commit changes. If anyone comes 
across this thread here is how you can read parquet files from an S3 directory 
using pyarrow. 
    
    **Make sure you have the packages:**
    
    `pip install pyarrow`
    `pip install s3fs`
    
    **Python Code:**
    
    ```
    import s3fs
    from pyarrow.filesystem import S3FSWrapper
    import pyarrow.parquet as pq
    access_key = <> # string with your aws_access_key_id
    secret_key = <> # string with your aws_secret_access_key
    fs = s3fs.S3FileSystem(key=access_key, secret=secret_key)
    
    # Suppose you had some parquet files stored in the
    # s3 path: s3://my_bucket/my_data/my_favorite_data
    bucket = 'my_bucket'
    path = 'my_data/my_favorite_data' 
    bucket_uri = 's3://{bucket}/{path}'.format(**{'bucket':bucket, 'path': 
path})
    dataset = pq.ParquetDataset(bucket_uri, filesystem=fs)
    table = dataset.read()
    df = table.to_pandas() 
    ```


> [Python] Enable s3fs to be used with ParquetDataset and reader/writer 
> functions
> -------------------------------------------------------------------------------
>
>                 Key: ARROW-1213
>                 URL: https://issues.apache.org/jira/browse/ARROW-1213
>             Project: Apache Arrow
>          Issue Type: Improvement
>            Reporter: Yacko
>            Assignee: Wes McKinney
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 0.6.0
>
>
> Pyarrow dataset function can't read from s3 using s3fs as the filesystem. Is  
> there a way we can add the support for read from s3 based on partitioned 
> files ?
> I am trying to address the problem mentioned in the stackoverflow link :
> https://stackoverflow.com/questions/45082832/how-to-read-partitioned-parquet-files-from-s3-using-pyarrow-in-python



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (ARROW-1213) [Python] Enable s3fs to be used with ParquetDataset and reader/writer functions

Reply via email to