[ https://issues.apache.org/jira/browse/ARROW-9514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Neal Richardson updated ARROW-9514: ----------------------------------- Fix Version/s: (was: 0.17.1) (was: 1.0.0) 2.0.0 > [Python] The new Dataset API will not work with files on Azure Blob > ------------------------------------------------------------------- > > Key: ARROW-9514 > URL: https://issues.apache.org/jira/browse/ARROW-9514 > Project: Apache Arrow > Issue Type: Improvement > Components: Python > Affects Versions: 1.0.0, 0.17.1 > Environment: Ubuntu 18.04 > Reporter: Lance Dacey > Priority: Major > Labels: cloud, dataset, parquet > Fix For: 2.0.0 > > > I tried using pyarrow.dataset and pq.ParquetDataset(use_legacy_system=False) > and my connection to Azure Blob fails. > > I know the documentation says only hdfs and s3 are implemented, but I have > been using Azure Blob by using fsspec as the filesystem when reading and > writing parquet files/datasets with Pyarrow (with use_legacy_system=True). > Also, Dask works with storage_options. > I am hoping that Azure Blob will be supported because I'd really like to try > out the new row filtering and non-hive partitioning schemes. > This is what I use for the filesystem when using read_table() or > write_to_dataset(): > > fs = fsspec.filesystem(protocol='abfs', > account_name=base.login, > account_key=base.password) > > > It seems like the class _ParquetDatasetV2 has a section that the original > ParquetDataset does not have. Perhaps this is why the fsspec filesystem fails > when I turn off the legacy system? > > Line 1423 in arrow/python/pyarrow/parquet.py: > > if filesystem is not None: > filesystem = pyarrow.fs._ensure_filesystem(filesystem, > use_mmap=memory_map) > EDIT - > I got this to work using fsspec on *single* files on Azure Blob: > import pyarrow.dataset as ds > import fsspec > fs = fsspec.filesystem(protocol='abfs', > account_name=login, > account_key=password) > dataset = ds.dataset("abfs://analytics/test/test..parquet", format="parquet", > filesystem=fs) > dataset.to_table(columns=['ticket_id', 'event_value'], > filter=ds.field('event_value') == > 'closed').to_pandas().drop_duplicates('ticket_id') > When I try to use this on a partitioned file I made using write_to_dataset, I > run into an error though. I tried this with the same code as above and also > with the partitioning='hive' option. > TypeError Traceback (most recent call last) > <ipython-input-174-f44e707aa83e> in <module> > ----> 1 dataset = ds.dataset("abfs://analytics/test/tickets-audits/", > format="parquet", filesystem=fs, partitioning="hive", ) > ~/.local/lib/python3.7/site-packages/pyarrow/dataset.py in dataset(source, > schema, format, filesystem, partitioning, partition_base_dir, > exclude_invalid_files, ignore_prefixes) > 665 # TODO(kszucs): support InMemoryDataset for a table input > 666 if _is_path_like(source): > --> 667 return _filesystem_dataset(source, **kwargs) > 668 elif isinstance(source, (tuple, list)): > 669 if all(_is_path_like(elem) for elem in source): > ~/.local/lib/python3.7/site-packages/pyarrow/dataset.py in > _filesystem_dataset(source, schema, filesystem, partitioning, format, > partition_base_dir, exclude_invalid_files, selector_ignore_prefixes) > 430 selector_ignore_prefixes=selector_ignore_prefixes > 431 ) > --> 432 factory = FileSystemDatasetFactory(fs, paths_or_selector, format, > options) > 433 > 434 return factory.finish(schema) > ~/.local/lib/python3.7/site-packages/pyarrow/_dataset.pyx in > pyarrow._dataset.FileSystemDatasetFactory.__init__() > ~/.local/lib/python3.7/site-packages/pyarrow/error.pxi in > pyarrow.lib.pyarrow_internal_check_status() > ~/.local/lib/python3.7/site-packages/pyarrow/_fs.pyx in > pyarrow._fs._cb_get_file_info_selector() > ~/.local/lib/python3.7/site-packages/pyarrow/fs.py in > get_file_info_selector(self, selector) > 159 infos = [] > 160 selected_files = self.fs.find( > --> 161 selector.base_dir, maxdepth=maxdepth, withdirs=True, > detail=True > 162 ) > 163 for path, info in selected_files.items(): > /opt/conda/lib/python3.7/site-packages/fsspec/spec.py in find(self, path, > maxdepth, withdirs, **kwargs) > 369 # TODO: allow equivalent of -name parameter > 370 out = set() > --> 371 for path, dirs, files in self.walk(path, maxdepth, **kwargs): > 372 if withdirs: > 373 files += dirs > /opt/conda/lib/python3.7/site-packages/fsspec/spec.py in walk(self, path, > maxdepth, **kwargs) > 324 > 325 try: > --> 326 listing = self.ls(path, detail=True, **kwargs) > 327 except (FileNotFoundError, IOError): > 328 return [], [], [] > TypeError: ls() got multiple values for keyword argument 'detail' > -- This message was sent by Atlassian Jira (v8.3.4#803005)