[
https://issues.apache.org/jira/browse/ARROW-5156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927572#comment-16927572
]
david cottrell commented on ARROW-5156:
---------------------------------------
I'm hitting this and it seems there is something to do with how the paths and
filesystems get resolved in (for example _```_get_filesystem_and_path``` in
pyarrow). I'm not quite clear on what is intended there across all the
different fs types.
>From the pandas side, the following seems to work (write_to_dataset is being
>called in the code path corresponding to parittion_cols not being None).
In [122]: api
Out[122]: <pandas.io.parquet.PyArrowImpl at 0x7f3c6cfa6470>
In [121]: api.api.parquet.write_to_dataset(table, path + '/more',
partition_cols=partition_cols, filesystem=s3fs.S3FileSystem()) # works properly
for me
In [120]: api.api.parquet.write_to_dataset(table, path + '/more',
partition_cols=partition_cols) # silent fails (no file written) or maybe I need
to flush.
At this point the minimal change seems like on the pandas side (will try to
post a branch soon) but I suspect that is not the *correct* fix. Likely
something to do with the chain of (path, fs) (fs, path) resolvers in
pyarrow/parquet.py and pyarrow/filesystem.py
> [Python] `df.to_parquet('s3://...', partition_cols=...)` fails with
> `'NoneType' object has no attribute '_isfilestore'`
> -----------------------------------------------------------------------------------------------------------------------
>
> Key: ARROW-5156
> URL: https://issues.apache.org/jira/browse/ARROW-5156
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.12.1
> Environment: Mac, Linux
> Reporter: Victor Shih
> Priority: Major
> Labels: parquet
> Fix For: 1.0.0
>
>
> According to
> [https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#partitioning-parquet-files],
> writing a parquet to S3 with `partition_cols` should work, but it fails for
> me. Example script:
> {code:java}
> import pandas as pd
> import sys
> print(sys.version)
> print(pd._version_)
> df = pd.DataFrame([{'a': 1, 'b': 2}])
> df.to_parquet('s3://my_s3_bucket/x.parquet', engine='pyarrow')
> print('OK 1')
> df.to_parquet('s3://my_s3_bucket/x2.parquet', partition_cols=['a'],
> engine='pyarrow')
> print('OK 2')
> {code}
> Output:
> {noformat}
> 3.5.2 (default, Feb 14 2019, 01:46:27)
> [GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)]
> 0.24.2
> OK 1
> Traceback (most recent call last):
> File "./t.py", line 14, in <module>
> df.to_parquet('s3://my_s3_bucket/x2.parquet', partition_cols=['a'],
> engine='pyarrow')
> File
> "/Users/vshih/.pyenv/versions/3.5.2/lib/python3.5/site-packages/pandas/core/frame.py",
> line 2203, in to_parquet
> partition_cols=partition_cols, **kwargs)
> File
> "/Users/vshih/.pyenv/versions/3.5.2/lib/python3.5/site-packages/pandas/io/parquet.py",
> line 252, in to_parquet
> partition_cols=partition_cols, **kwargs)
> File
> "/Users/vshih/.pyenv/versions/3.5.2/lib/python3.5/site-packages/pandas/io/parquet.py",
> line 118, in write
> partition_cols=partition_cols, **kwargs)
> File
> "/Users/vshih/.pyenv/versions/3.5.2/lib/python3.5/site-packages/pyarrow/parquet.py",
> line 1227, in write_to_dataset
> _mkdir_if_not_exists(fs, root_path)
> File
> "/Users/vshih/.pyenv/versions/3.5.2/lib/python3.5/site-packages/pyarrow/parquet.py",
> line 1182, in _mkdir_if_not_exists
> if fs._isfilestore() and not fs.exists(path):
> AttributeError: 'NoneType' object has no attribute '_isfilestore'
> {noformat}
>
> Original issue - [https://github.com/apache/arrow/issues/4030]
--
This message was sent by Atlassian Jira
(v8.3.2#803003)