[
https://issues.apache.org/jira/browse/ARROW-9514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lance Dacey updated ARROW-9514:
-------------------------------
Description:
I tried using pyarrow.dataset and pq.ParquetDataset(use_legacy_system=False)
and my connection to Azure Blob fails.
I know the documentation says only hdfs and s3 are implemented, but I have been
using Azure Blob by using fsspec as the filesystem when reading and writing
parquet files/datasets with Pyarrow (with use_legacy_system=True). Also, Dask
works with storage_options.
I am hoping that Azure Blob will be supported because I'd really like to try
out the new row filtering and non-hive partitioning schemes.
This is what I use for the filesystem when using read_table() or
write_to_dataset():
fs = fsspec.filesystem(protocol='abfs',
account_name=base.login,
account_key=base.password)
It seems like the class _ParquetDatasetV2 has a section that the original
ParquetDataset does not have. Perhaps this is why the fsspec filesystem fails
when I turn off the legacy system?
Line 1423 in arrow/python/pyarrow/parquet.py:
if filesystem is not None:
filesystem = pyarrow.fs._ensure_filesystem(filesystem, use_mmap=memory_map)
was:
I tried using pyarrow.dataset and pq.ParquetDataset(use_legacy_system=False)
and my connection to Azure Blob fails.
I know the documentation says only hdfs and s3 are implemented, but I have been
using Azure Blob by using fsspec as the filesystem when reading and writing
parquet files/datasets with Pyarrow (with use_legacy_system=True). Also, Dask
works with storage_options.
I am hoping that Azure Blob will be supported because I'd really like to try
out the new row filtering and non-hive partitioning schemes.
This is what I use for the filesystem when using read_table() or
write_to_dataset():
fs = fsspec.filesystem(protocol='abfs',
account_name=base.login,
account_key=base.password)
> The new Dataset API will not work with files on Azure Blob (pq.read_table()
> does work and so does Dask)
> -------------------------------------------------------------------------------------------------------
>
> Key: ARROW-9514
> URL: https://issues.apache.org/jira/browse/ARROW-9514
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Affects Versions: 0.17.1
> Environment: Ubuntu 18.04
> Reporter: Lance Dacey
> Priority: Major
> Labels: cloud, dataset, parquet
> Fix For: 0.17.1
>
>
> I tried using pyarrow.dataset and pq.ParquetDataset(use_legacy_system=False)
> and my connection to Azure Blob fails.
>
> I know the documentation says only hdfs and s3 are implemented, but I have
> been using Azure Blob by using fsspec as the filesystem when reading and
> writing parquet files/datasets with Pyarrow (with use_legacy_system=True).
> Also, Dask works with storage_options.
> I am hoping that Azure Blob will be supported because I'd really like to try
> out the new row filtering and non-hive partitioning schemes.
> This is what I use for the filesystem when using read_table() or
> write_to_dataset():
>
> fs = fsspec.filesystem(protocol='abfs',
> account_name=base.login,
> account_key=base.password)
>
>
> It seems like the class _ParquetDatasetV2 has a section that the original
> ParquetDataset does not have. Perhaps this is why the fsspec filesystem fails
> when I turn off the legacy system?
>
> Line 1423 in arrow/python/pyarrow/parquet.py:
>
> if filesystem is not None:
> filesystem = pyarrow.fs._ensure_filesystem(filesystem,
> use_mmap=memory_map)
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)