[jira] [Updated] (ARROW-9514) [Python] The new Dataset API will not work with files on Azure Blob

Neal Richardson (Jira) Mon, 03 Aug 2020 08:03:06 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-9514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Neal Richardson updated ARROW-9514:
-----------------------------------
    Fix Version/s:     (was: 0.17.1)
                       (was: 1.0.0)
                   2.0.0

> [Python] The new Dataset API will not work with files on Azure Blob
> -------------------------------------------------------------------
>
>                 Key: ARROW-9514
>                 URL: https://issues.apache.org/jira/browse/ARROW-9514
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>    Affects Versions: 1.0.0, 0.17.1
>         Environment: Ubuntu 18.04
>            Reporter: Lance Dacey
>            Priority: Major
>              Labels: cloud, dataset, parquet
>             Fix For: 2.0.0
>
>
> I tried using  pyarrow.dataset and pq.ParquetDataset(use_legacy_system=False) 
> and my connection to Azure Blob fails. 
>  
> I know the documentation says only hdfs and s3 are implemented, but I have 
> been using Azure Blob by using fsspec as the filesystem when reading and 
> writing parquet files/datasets with Pyarrow (with use_legacy_system=True). 
> Also, Dask works with storage_options.
> I am hoping that Azure Blob will be supported because I'd really like to try 
> out the new row filtering and non-hive partitioning schemes.
> This is what I use for the filesystem when using read_table() or 
> write_to_dataset():
>  
> fs = fsspec.filesystem(protocol='abfs', 
>  account_name=base.login, 
>  account_key=base.password)
>  
>  
> It seems like the class _ParquetDatasetV2 has a section that the original 
> ParquetDataset does not have. Perhaps this is why the fsspec filesystem fails 
> when I turn off the legacy system?
>  
> Line 1423 in arrow/python/pyarrow/parquet.py:
>  
> if filesystem is not None:
>     filesystem = pyarrow.fs._ensure_filesystem(filesystem, 
> use_mmap=memory_map) 
> EDIT -
> I got this to work using fsspec on *single* files on Azure Blob:
> import pyarrow.dataset as ds
> import fsspec
> fs = fsspec.filesystem(protocol='abfs', 
>                        account_name=login, 
>                        account_key=password)
> dataset = ds.dataset("abfs://analytics/test/test..parquet", format="parquet", 
> filesystem=fs)
> dataset.to_table(columns=['ticket_id', 'event_value'], 
> filter=ds.field('event_value') == 
> 'closed').to_pandas().drop_duplicates('ticket_id')
> When I try to use this on a partitioned file I made using write_to_dataset, I 
> run into an error though. I tried this with the same code as above and also 
> with the partitioning='hive' option.
> TypeError                                 Traceback (most recent call last)
> <ipython-input-174-f44e707aa83e> in <module>
> ----> 1 dataset = ds.dataset("abfs://analytics/test/tickets-audits/", 
> format="parquet", filesystem=fs, partitioning="hive", )
> ~/.local/lib/python3.7/site-packages/pyarrow/dataset.py in dataset(source, 
> schema, format, filesystem, partitioning, partition_base_dir, 
> exclude_invalid_files, ignore_prefixes)
>     665     # TODO(kszucs): support InMemoryDataset for a table input
>     666     if _is_path_like(source):
> --> 667         return _filesystem_dataset(source, **kwargs)
>     668     elif isinstance(source, (tuple, list)):
>     669         if all(_is_path_like(elem) for elem in source):
> ~/.local/lib/python3.7/site-packages/pyarrow/dataset.py in 
> _filesystem_dataset(source, schema, filesystem, partitioning, format, 
> partition_base_dir, exclude_invalid_files, selector_ignore_prefixes)
>     430         selector_ignore_prefixes=selector_ignore_prefixes
>     431     )
> --> 432     factory = FileSystemDatasetFactory(fs, paths_or_selector, format, 
> options)
>     433 
>     434     return factory.finish(schema)
> ~/.local/lib/python3.7/site-packages/pyarrow/_dataset.pyx in 
> pyarrow._dataset.FileSystemDatasetFactory.__init__()
> ~/.local/lib/python3.7/site-packages/pyarrow/error.pxi in 
> pyarrow.lib.pyarrow_internal_check_status()
> ~/.local/lib/python3.7/site-packages/pyarrow/_fs.pyx in 
> pyarrow._fs._cb_get_file_info_selector()
> ~/.local/lib/python3.7/site-packages/pyarrow/fs.py in 
> get_file_info_selector(self, selector)
>     159         infos = []
>     160         selected_files = self.fs.find(
> --> 161             selector.base_dir, maxdepth=maxdepth, withdirs=True, 
> detail=True
>     162         )
>     163         for path, info in selected_files.items():
> /opt/conda/lib/python3.7/site-packages/fsspec/spec.py in find(self, path, 
> maxdepth, withdirs, **kwargs)
>     369         # TODO: allow equivalent of -name parameter
>     370         out = set()
> --> 371         for path, dirs, files in self.walk(path, maxdepth, **kwargs):
>     372             if withdirs:
>     373                 files += dirs
> /opt/conda/lib/python3.7/site-packages/fsspec/spec.py in walk(self, path, 
> maxdepth, **kwargs)
>     324 
>     325         try:
> --> 326             listing = self.ls(path, detail=True, **kwargs)
>     327         except (FileNotFoundError, IOError):
>     328             return [], [], []
> TypeError: ls() got multiple values for keyword argument 'detail'
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9514) [Python] The new Dataset API will not work with files on Azure Blob

Reply via email to