[GitHub] [arrow] salman1993 commented on issue #30481: [Python] Reading Hive-style partitioned parquet files from GCS

via GitHub Tue, 12 Sep 2023 09:16:00 -0700


salman1993 commented on issue #30481:
URL: https://github.com/apache/arrow/issues/30481#issuecomment-1716030767


   We are also facing a similar issue. We have a Hive-style partitioned parquet 
dataset written with Spark. We cannot load it up with pyarrow (using gcsfs as 
the filesystem). Getting a FileNotFoundError when we run:
   ```
   pq_ds = pq.ParquetDataset(
       path,
       filesystem=gcsfs.GCSFileSystem(),
       pre_buffer=False,
       use_legacy_dataset=False,
       partitioning="hive",
   )
   ```
   
   Error:
   ```
   Traceback (most recent call last):
     File 
"/Users/smohammed/Development/playground/read_parquet/benchmark_pyarrow_gcs.py",
 line 35, in <module>
       pq_ds = pq.ParquetDataset(
     File 
"/Users/smohammed/.pyenv/versions/def/lib/python3.9/site-packages/pyarrow/parquet/__init__.py",
 line 1663, in __new__
       return _ParquetDatasetV2(
     File 
"/Users/smohammed/.pyenv/versions/def/lib/python3.9/site-packages/pyarrow/parquet/__init__.py",
 line 2351, in __init__
       self._dataset = ds.dataset(path_or_paths, filesystem=filesystem,
     File 
"/Users/smohammed/.pyenv/versions/def/lib/python3.9/site-packages/pyarrow/dataset.py",
 line 694, in dataset
       return _filesystem_dataset(source, **kwargs)
     File 
"/Users/smohammed/.pyenv/versions/def/lib/python3.9/site-packages/pyarrow/dataset.py",
 line 449, in _filesystem_dataset
       return factory.finish(schema)
     File "pyarrow/_dataset.pyx", line 1857, in 
pyarrow._dataset.DatasetFactory.finish
     File "pyarrow/error.pxi", line 144, in 
pyarrow.lib.pyarrow_internal_check_status
     File "pyarrow/_fs.pyx", line 1190, in pyarrow._fs._cb_open_input_file
     File 
"/Users/smohammed/.pyenv/versions/def/lib/python3.9/site-packages/pyarrow/fs.py",
 line 400, in open_input_file
       raise FileNotFoundError(path)
   FileNotFoundError: <redacted>/benchmark_pq_data/small/
   ```
   
   Can also confirm that the files do exist and we can load up individual files 
using `pq.read_table(...)` 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] salman1993 commented on issue #30481: [Python] Reading Hive-style partitioned parquet files from GCS

Reply via email to