salman1993 commented on issue #30481:
URL: https://github.com/apache/arrow/issues/30481#issuecomment-1716030767
We are also facing a similar issue. We have a Hive-style partitioned parquet
dataset written with Spark. We cannot load it up with pyarrow (using gcsfs as
the filesystem). Getting a FileNotFoundError when we run:
```
pq_ds = pq.ParquetDataset(
path,
filesystem=gcsfs.GCSFileSystem(),
pre_buffer=False,
use_legacy_dataset=False,
partitioning="hive",
)
```
Error:
```
Traceback (most recent call last):
File
"/Users/smohammed/Development/playground/read_parquet/benchmark_pyarrow_gcs.py",
line 35, in <module>
pq_ds = pq.ParquetDataset(
File
"/Users/smohammed/.pyenv/versions/def/lib/python3.9/site-packages/pyarrow/parquet/__init__.py",
line 1663, in __new__
return _ParquetDatasetV2(
File
"/Users/smohammed/.pyenv/versions/def/lib/python3.9/site-packages/pyarrow/parquet/__init__.py",
line 2351, in __init__
self._dataset = ds.dataset(path_or_paths, filesystem=filesystem,
File
"/Users/smohammed/.pyenv/versions/def/lib/python3.9/site-packages/pyarrow/dataset.py",
line 694, in dataset
return _filesystem_dataset(source, **kwargs)
File
"/Users/smohammed/.pyenv/versions/def/lib/python3.9/site-packages/pyarrow/dataset.py",
line 449, in _filesystem_dataset
return factory.finish(schema)
File "pyarrow/_dataset.pyx", line 1857, in
pyarrow._dataset.DatasetFactory.finish
File "pyarrow/error.pxi", line 144, in
pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/_fs.pyx", line 1190, in pyarrow._fs._cb_open_input_file
File
"/Users/smohammed/.pyenv/versions/def/lib/python3.9/site-packages/pyarrow/fs.py",
line 400, in open_input_file
raise FileNotFoundError(path)
FileNotFoundError: <redacted>/benchmark_pq_data/small/
```
Can also confirm that the files do exist and we can load up individual files
using `pq.read_table(...)`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]