[jira] [Updated] (ARROW-9514) The new Dataset API will not work with files on Azure Blob (pq.read_table() does work and so does Dask)

Lance Dacey (Jira) Fri, 17 Jul 2020 09:19:25 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-9514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Lance Dacey updated ARROW-9514:
-------------------------------
    Description: 
I tried using  pyarrow.dataset and pq.ParquetDataset(use_legacy_system=False) 
and my connection to Azure Blob fails. 

 

I know the documentation says only hdfs and s3 are implemented, but I have been 
using Azure Blob by using fsspec as the filesystem when reading and writing 
parquet files/datasets with Pyarrow (with use_legacy_system=True). Also, Dask 
works with storage_options.

I am hoping that Azure Blob will be supported because I'd really like to try 
out the new row filtering and non-hive partitioning schemes.

This is what I use for the filesystem when using read_table() or 
write_to_dataset():

 

fs = fsspec.filesystem(protocol='abfs', 
 account_name=base.login, 
 account_key=base.password)

 

 

It seems like the class _ParquetDatasetV2 has a section that the original 
ParquetDataset does not have. Perhaps this is why the fsspec filesystem fails 
when I turn off the legacy system?

 

Line 1423 in arrow/python/pyarrow/parquet.py:

 

if filesystem is not None:
    filesystem = pyarrow.fs._ensure_filesystem(filesystem, use_mmap=memory_map) 

 

  was:
I tried using  pyarrow.dataset and pq.ParquetDataset(use_legacy_system=False) 
and my connection to Azure Blob fails. 

 

I know the documentation says only hdfs and s3 are implemented, but I have been 
using Azure Blob by using fsspec as the filesystem when reading and writing 
parquet files/datasets with Pyarrow (with use_legacy_system=True). Also, Dask 
works with storage_options.

I am hoping that Azure Blob will be supported because I'd really like to try 
out the new row filtering and non-hive partitioning schemes.

This is what I use for the filesystem when using read_table() or 
write_to_dataset():

 

fs = fsspec.filesystem(protocol='abfs', 
 account_name=base.login, 
 account_key=base.password)

 

 


> The new Dataset API will not work with files on Azure Blob (pq.read_table() 
> does work and so does Dask)
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-9514
>                 URL: https://issues.apache.org/jira/browse/ARROW-9514
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>    Affects Versions: 0.17.1
>         Environment: Ubuntu 18.04
>            Reporter: Lance Dacey
>            Priority: Major
>              Labels: cloud, dataset, parquet
>             Fix For: 0.17.1
>
>
> I tried using  pyarrow.dataset and pq.ParquetDataset(use_legacy_system=False) 
> and my connection to Azure Blob fails. 
>  
> I know the documentation says only hdfs and s3 are implemented, but I have 
> been using Azure Blob by using fsspec as the filesystem when reading and 
> writing parquet files/datasets with Pyarrow (with use_legacy_system=True). 
> Also, Dask works with storage_options.
> I am hoping that Azure Blob will be supported because I'd really like to try 
> out the new row filtering and non-hive partitioning schemes.
> This is what I use for the filesystem when using read_table() or 
> write_to_dataset():
>  
> fs = fsspec.filesystem(protocol='abfs', 
>  account_name=base.login, 
>  account_key=base.password)
>  
>  
> It seems like the class _ParquetDatasetV2 has a section that the original 
> ParquetDataset does not have. Perhaps this is why the fsspec filesystem fails 
> when I turn off the legacy system?
>  
> Line 1423 in arrow/python/pyarrow/parquet.py:
>  
> if filesystem is not None:
>     filesystem = pyarrow.fs._ensure_filesystem(filesystem, 
> use_mmap=memory_map) 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9514) The new Dataset API will not work with files on Azure Blob (pq.read_table() does work and so does Dask)

Reply via email to