Lance Dacey created ARROW-10517: ----------------------------------- Summary: [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob Key: ARROW-10517 URL: https://issues.apache.org/jira/browse/ARROW-10517 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 2.0.0 Environment: Ubuntu 18.04 Reporter: Lance Dacey
If I downgrade adlfs to 0.2.5 and azure-blob-storage to 2.1, and then upgrade fsspec (0.6.2 has errors with a detail kwarg, so I need to upgrade it): {code:java} pa.dataset.write_dataset(data=table, base_dir="test/test7", basename_template=None, format="parquet", partitioning=DirectoryPartitioning(pa.schema([("year", pa.int64()), ("month", pa.int16()), ("day", pa.int16())])), schema=table.schema, filesystem=blob_fs){code} {code:java} 226 def create_dir(self, path, recursive): 227 # mkdir also raises FileNotFoundError when base directory is not found --> 228 self.fs.mkdir(path, create_parents=recursive){code} It does not look like there is a mkdir option. However, the output of fs.find() returns a dictionary as expected: {code:java} selected_files = blob_fs.find( "test/test6", maxdepth=None, withdirs=True, detail=True ){code} Now if I install the latest version of adlfs it upgrades my blob SDK to 12.5 (unfortunately, I cannot use this in production since Airflow requires 2.1, so this is only for testing purposes): {code:java} Successfully installed adlfs-0.5.5 azure-storage-blob-12.5.0{code} Now fs.find() returns a list, but I am able to use fs.mkdir(). {code:java} ['test/test6/year=2020', 'test/test6/year=2020/month=11', 'test/test6/year=2020/month=11/day=1', 'test/test6/year=2020/month=11/day=1/8ee6c66320ca47908c37f112f0cffd6c.parquet', 'test/test6/year=2020/month=11/day=1/ef753f016efc44b7b0f0800c35d084fc.parquet',]{code} This causes issues later when I try to read a dataset (the code is expecting a dictionary still): {code:java} dataset = ds.dataset("test/test5", filesystem=blob_fs, format="parquet"){code} {code:java} --> 221 for path, info in selected_files.items(): 222 infos.append(self._create_file_info(path, info)) 223 AttributeError: 'list' object has no attribute 'items'{code} I am still able to read individual files: {code:java} dataset = ds.dataset("test/test4/year=2020/month=11/2020-11.parquet", filesystem=blob_fs, format="parquet"){code} And I can read the dataset if I pass in a list of blob names "manually": {code:java} blobs = wasb.list_blobs(container_name="test", prefix="test4") dataset = ds.dataset(source=["test/" + blob.name for blob in blobs], format="parquet", partitioning="hive", filesystem=blob_fs) {code} For all of my examples, blob_fs is defined by: {code:java} blob_fs = fsspec.filesystem( protocol="abfs", account_name=base.login, account_key=base.password ){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)