benhc opened a new issue, #36983: URL: https://github.com/apache/arrow/issues/36983
### Describe the bug, including details regarding any error messages, version, and platform. I need to use s3fs as the filesystem in the `dataset` constructor due to performance considerations raised in https://github.com/apache/arrow/issues/33169. However, when I try to do so I get the following stack trace: ``` --------------------------------------------------------------------------- FileNotFoundError Traceback (most recent call last) Cell In[14], line 1 ----> 1 ds.dataset( 2 "my_bucket/my_dataset_path", 3 filesystem=fs, 4 partitioning=None, 5 format="parquet" 6 ) File ~/blah/.venv/lib/python3.10/site-packages/pyarrow/dataset.py:763, in dataset(source, schema, format, filesystem, partitioning, partition_base_dir, exclude_invalid_files, ignore_prefixes) 752 kwargs = dict( 753 schema=schema, 754 filesystem=filesystem, (...) 759 selector_ignore_prefixes=ignore_prefixes 760 ) 762 if _is_path_like(source): --> 763 return _filesystem_dataset(source, **kwargs) 764 elif isinstance(source, (tuple, list)): 765 if all(_is_path_like(elem) for elem in source): File ~/blah/.venv/lib/python3.10/site-packages/pyarrow/dataset.py:456, in _filesystem_dataset(source, schema, filesystem, partitioning, format, partition_base_dir, exclude_invalid_files, selector_ignore_prefixes) 448 options = FileSystemFactoryOptions( 449 partitioning=partitioning, 450 partition_base_dir=partition_base_dir, 451 exclude_invalid_files=exclude_invalid_files, 452 selector_ignore_prefixes=selector_ignore_prefixes 453 ) 454 factory = FileSystemDatasetFactory(fs, paths_or_selector, format, options) --> 456 return factory.finish(schema) File ~/blah/.venv/lib/python3.10/site-packages/pyarrow/_dataset.pyx:2752, in pyarrow._dataset.DatasetFactory.finish() File ~/blah/.venv/lib/python3.10/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status() File ~/blah/.venv/lib/python3.10/site-packages/pyarrow/_fs.pyx:1551, in pyarrow._fs._cb_open_input_file() File ~/blah/.venv/lib/python3.10/site-packages/pyarrow/fs.py:424, in FSSpecHandler.open_input_file(self, path) 421 from pyarrow import PythonFile 423 if not self.fs.isfile(path): --> 424 raise FileNotFoundError(path) 426 return PythonFile(self.fs.open(path, mode="rb"), mode="r") FileNotFoundError: my_bucket/my_dataset_path/ ``` This is due to inconsistencies in the result of `get_file_info` from the s3fs filesystem when mapped into a `PyFileSystem` vs the pyarrow S3 filesystem. Code to demonstrate: ``` from boto3 import Session import s3fs import pyarrow.dataset as ds import pyarrow as pa from pyarrow.fs import PyFileSystem, FSSpecHandler, FileSelector session = Session() credentials = session.get_credentials() s3_fs = s3fs.S3FileSystem( anon=False, key=credentials.access_key, secret=credentials.secret_key, token=credentials.token, ) pa_s3_fs = PyFileSystem(FSSpecHandler(s3_fs)) pa_fs = pa.fs.S3FileSystem( access_key=credentials.access_key, secret_key=credentials.secret_key, session_token=credentials.token, ) selector = FileSelector("my_bucket/my_dataset_path/", recursive=True) ``` ``` pa_s3_fs.get_file_info(selector) >>> [<FileInfo for 'my_bucket/my_dataset_path/': type=FileType.File, size=0>, <FileInfo for 'my_bucket/my_dataset_path/part-0.parquet': type=FileType.File, size=1707552354>] ``` ``` pa_fs.get_file_info(selector) >>> [<FileInfo for 'my_bucket/my_dataset_path/part-0.parquet': type=FileType.File, size=1707552354>] ``` This 0 length file is created by S3 when the folder is created, and can be seen by calling: ``` s3_fs.find(selector.base_dir, maxdepth=None, withdirs=True, detail=True) >>> {'my_bucket/my_dataset_path': {'Key': 'my_bucket/my_dataset_path', 'LastModified': datetime.datetime(2022, 11, 3, 15, 3, 4, tzinfo=tzutc()), 'ETag': '"d41d8cd92342344e9800998ecf8427e"', 'Size': 0, 'StorageClass': 'STANDARD', 'type': 'file', 'size': 0, 'name': 'my_bucket/my_dataset_path'}, 'my_bucket/my_dataset_pathpart-0.parquet': {'Key': 'my_bucket/my_dataset_pathpart-0.parquet', 'LastModified': datetime.datetime(2023, 8, 1, 9, 8, 21, tzinfo=tzutc()), 'ETag': '"ce6df3f4234f5e078431949620e203c8-33"', 'Size': 1707552354, 'StorageClass': 'STANDARD', 'type': 'file', 'size': 1707552354, 'name': 'my_bucket/my_dataset_path/part-0.parquet'}} ``` I assume, but cannot find, that somewhere in the pyarrow S3 filesystem code these 0 length files are ignored. The filtering behaviour should be the same for fsspec filesystems. ### Component(s) Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
