Lance Dacey created ARROW-9514:
----------------------------------
Summary: The new Dataset API will not work with files on Azure
Blob (pq.read_table() does work and so does Dask)
Key: ARROW-9514
URL: https://issues.apache.org/jira/browse/ARROW-9514
Project: Apache Arrow
Issue Type: Improvement
Components: Python
Affects Versions: 0.17.1
Environment: Ubuntu 18.04
Reporter: Lance Dacey
Fix For: 0.17.1
I tried using pyarrow.dataset and pq.ParquetDataset(use_legacy_system=False)
and my connection to Azure Blob fails.
I know the documentation says only hdfs and s3 are implemented, but I have been
using Azure Blob by using fsspec as the filesystem when reading and writing
parquet files/datasets with Pyarrow (with use_legacy_system=True). Also, Dask
works with storage_options.
I am hoping that Azure Blob will be supported because I'd really like to try
out the new row filtering and non-hive partitioning schemes.
This is what I use for the filesystem when using read_table() or
write_to_dataset():
fs = fsspec.filesystem(protocol='abfs',
account_name=base.login,
account_key=base.password)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)