[
https://issues.apache.org/jira/browse/ARROW-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16209473#comment-16209473
]
ASF GitHub Bot commented on ARROW-1213:
---------------------------------------
Github user DrChrisLevy commented on the issue:
https://github.com/apache/arrow/pull/916
Thanks @wesm !
I figured it out by looking through the commit changes. If anyone comes
across this thread here is how you can read parquet files from an S3 directory
using pyarrow.
**Make sure you have the packages:**
`pip install pyarrow`
`pip install s3fs`
**Python Code:**
```
import s3fs
from pyarrow.filesystem import S3FSWrapper
import pyarrow.parquet as pq
access_key = <> # string with your aws_access_key_id
secret_key = <> # string with your aws_secret_access_key
fs = s3fs.S3FileSystem(key=access_key, secret=secret_key)
# Suppose you had some parquet files stored in the
# s3 path: s3://my_bucket/my_data/my_favorite_data
bucket = 'my_bucket'
path = 'my_data/my_favorite_data'
bucket_uri = 's3://{bucket}/{path}'.format(**{'bucket':bucket, 'path':
path})
dataset = pq.ParquetDataset(bucket_uri, filesystem=fs)
table = dataset.read()
df = table.to_pandas()
```
> [Python] Enable s3fs to be used with ParquetDataset and reader/writer
> functions
> -------------------------------------------------------------------------------
>
> Key: ARROW-1213
> URL: https://issues.apache.org/jira/browse/ARROW-1213
> Project: Apache Arrow
> Issue Type: Improvement
> Reporter: Yacko
> Assignee: Wes McKinney
> Priority: Minor
> Labels: pull-request-available
> Fix For: 0.6.0
>
>
> Pyarrow dataset function can't read from s3 using s3fs as the filesystem. Is
> there a way we can add the support for read from s3 based on partitioned
> files ?
> I am trying to address the problem mentioned in the stackoverflow link :
> https://stackoverflow.com/questions/45082832/how-to-read-partitioned-parquet-files-from-s3-using-pyarrow-in-python
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)