[GitHub] [arrow] benhc opened a new issue, #36983: Different get_file_info behaviour between pyarrow.fs.S3FileSystem and s3fs

via GitHub Tue, 01 Aug 2023 09:28:06 -0700


benhc opened a new issue, #36983:
URL: https://github.com/apache/arrow/issues/36983


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   I need to use s3fs as the filesystem in the `dataset` constructor due to 
performance considerations raised in 
https://github.com/apache/arrow/issues/33169. However, when I try to do so I 
get the following stack trace:
   
   ```
   ---------------------------------------------------------------------------
   FileNotFoundError                         Traceback (most recent call last)
   Cell In[14], line 1
   ----> 1 ds.dataset(
         2     "my_bucket/my_dataset_path",
         3     filesystem=fs,
         4     partitioning=None,
         5     format="parquet"
         6 )
   
   File ~/blah/.venv/lib/python3.10/site-packages/pyarrow/dataset.py:763, in 
dataset(source, schema, format, filesystem, partitioning, partition_base_dir, 
exclude_invalid_files, ignore_prefixes)
       752 kwargs = dict(
       753     schema=schema,
       754     filesystem=filesystem,
      (...)
       759     selector_ignore_prefixes=ignore_prefixes
       760 )
       762 if _is_path_like(source):
   --> 763     return _filesystem_dataset(source, **kwargs)
       764 elif isinstance(source, (tuple, list)):
       765     if all(_is_path_like(elem) for elem in source):
   
   File ~/blah/.venv/lib/python3.10/site-packages/pyarrow/dataset.py:456, in 
_filesystem_dataset(source, schema, filesystem, partitioning, format, 
partition_base_dir, exclude_invalid_files, selector_ignore_prefixes)
       448 options = FileSystemFactoryOptions(
       449     partitioning=partitioning,
       450     partition_base_dir=partition_base_dir,
       451     exclude_invalid_files=exclude_invalid_files,
       452     selector_ignore_prefixes=selector_ignore_prefixes
       453 )
       454 factory = FileSystemDatasetFactory(fs, paths_or_selector, format, 
options)
   --> 456 return factory.finish(schema)
   
   File ~/blah/.venv/lib/python3.10/site-packages/pyarrow/_dataset.pyx:2752, in 
pyarrow._dataset.DatasetFactory.finish()
   
   File ~/blah/.venv/lib/python3.10/site-packages/pyarrow/error.pxi:144, in 
pyarrow.lib.pyarrow_internal_check_status()
   
   File ~/blah/.venv/lib/python3.10/site-packages/pyarrow/_fs.pyx:1551, in 
pyarrow._fs._cb_open_input_file()
   
   File ~/blah/.venv/lib/python3.10/site-packages/pyarrow/fs.py:424, in 
FSSpecHandler.open_input_file(self, path)
       421 from pyarrow import PythonFile
       423 if not self.fs.isfile(path):
   --> 424     raise FileNotFoundError(path)
       426 return PythonFile(self.fs.open(path, mode="rb"), mode="r")
   
   FileNotFoundError: my_bucket/my_dataset_path/
   
   ```
   
   This is due to inconsistencies in the result of `get_file_info` from the 
s3fs filesystem when mapped into a `PyFileSystem` vs the pyarrow S3 filesystem.
   
   Code to demonstrate:
   ```
   from boto3 import Session
   import s3fs
   import pyarrow.dataset as ds
   import pyarrow as pa
   from pyarrow.fs import PyFileSystem, FSSpecHandler, FileSelector
   
   session = Session()
   credentials = session.get_credentials()
   s3_fs = s3fs.S3FileSystem(
       anon=False,
       key=credentials.access_key,
       secret=credentials.secret_key,
       token=credentials.token,
   )
   pa_s3_fs = PyFileSystem(FSSpecHandler(s3_fs))
   
   pa_fs = pa.fs.S3FileSystem(
       access_key=credentials.access_key,
       secret_key=credentials.secret_key,
       session_token=credentials.token,
   )
   
   selector = FileSelector("my_bucket/my_dataset_path/", recursive=True)
   ```
   
   ```
   pa_s3_fs.get_file_info(selector)
   >>> [<FileInfo for 'my_bucket/my_dataset_path/': type=FileType.File, size=0>,
    <FileInfo for 'my_bucket/my_dataset_path/part-0.parquet': 
type=FileType.File, size=1707552354>]
   ```
   
   ```
   pa_fs.get_file_info(selector)
   >>> [<FileInfo for 'my_bucket/my_dataset_path/part-0.parquet': 
type=FileType.File, size=1707552354>]
   ```
   
   This 0 length file is created by S3 when the folder is created, and can be 
seen by calling:
   ```
   s3_fs.find(selector.base_dir, maxdepth=None, withdirs=True, detail=True)
   >>> {'my_bucket/my_dataset_path': {'Key': 'my_bucket/my_dataset_path',
     'LastModified': datetime.datetime(2022, 11, 3, 15, 3, 4, tzinfo=tzutc()),
     'ETag': '"d41d8cd92342344e9800998ecf8427e"',
     'Size': 0,
     'StorageClass': 'STANDARD',
     'type': 'file',
     'size': 0,
     'name': 'my_bucket/my_dataset_path'},
    'my_bucket/my_dataset_pathpart-0.parquet': {'Key': 
'my_bucket/my_dataset_pathpart-0.parquet',
     'LastModified': datetime.datetime(2023, 8, 1, 9, 8, 21, tzinfo=tzutc()),
     'ETag': '"ce6df3f4234f5e078431949620e203c8-33"',
     'Size': 1707552354,
     'StorageClass': 'STANDARD',
     'type': 'file',
     'size': 1707552354,
     'name': 'my_bucket/my_dataset_path/part-0.parquet'}}
   ```
   
   I assume, but cannot find, that somewhere in the pyarrow S3 filesystem code 
these 0 length files are ignored. The filtering behaviour should be the same 
for fsspec filesystems.
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] benhc opened a new issue, #36983: Different get_file_info behaviour between pyarrow.fs.S3FileSystem and s3fs

Reply via email to