[
https://issues.apache.org/jira/browse/ARROW-14959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Garrett Weaver updated ARROW-14959:
-----------------------------------
Description:
Trying to read a spark-generated hive-style partitioned parquet dataset with
`gcsfs` and `pyarrow`, but getting a `FileNotFoundError` if I try to read from
the base directory or even if try to read directly from one of the partitions.
Not sure if I am doing something wrong or it is not supported.
Note that I have successfully read this hive-style partitioned parquet dataset
using other methods to rule out any other issues, including:
* Successful read with pyspark by using spark.read.parquet
* Successful read of a specific partition by passing list of paths to
`ParquetDataset`
* Also, tested reading another spark-generated parquet dataset with no
Hive-style partitions from GCS and that worked as well.
Below is what I am trying:
{{{code:python}}}
import gcsfs
import pyarrow as pa
gcs = gcsfs.GCSFileSystem()
partitions = pa.dataset.partitioning(
pa.schema([("partition_var", pa.string())]), dictionaries="infer",
flavor="hive"
)
partitioned_dataset = pa.dataset.dataset(
"path/to/partitioned/dataset/base/dir",
filesystem=gcs,
format="parquet",
partitioning=partitions,
)
partition_of_dataset = pa.dataset.dataset(
"path/to/partitioned/dataset/base/dir/partition_var=some_value",
filesystem=gcs,
format="parquet",
)
{{{code}}}
The errors returned for both are below:
```
FileNotFoundError: path/to/partitioned/dataset/base/dir/
FileNotFoundError:
path/to/partitioned/dataset/base/dir/partition_var=some_value/
```
was:
Trying to read a spark-generated hive-style partitioned parquet dataset with
`gcsfs` and `pyarrow`, but getting a `FileNotFoundError` if I try to read from
the base directory or even if try to read directly from one of the partitions.
Not sure if I am doing something wrong or it is not supported.
Note that I have successfully read this hive-style partitioned parquet dataset
using other methods to rule out any other issues, including:
* Successful read with pyspark by using spark.read.parquet
* Successful read of a specific partition by passing list of paths to
`ParquetDataset`
* Also, tested reading another spark-generated parquet dataset with no
Hive-style partitions from GCS and that worked as well.
Below is what I am trying:
```
import gcsfs
import pyarrow as pa
gcs = gcsfs.GCSFileSystem()
partitions = pa.dataset.partitioning(
pa.schema([("partition_var", pa.string())]), dictionaries="infer",
flavor="hive"
)
partitioned_dataset = pa.dataset.dataset(
"path/to/partitioned/dataset/base/dir",
filesystem=gcs,
format="parquet",
partitioning=partitions,
)
partition_of_dataset = pa.dataset.dataset(
"path/to/partitioned/dataset/base/dir/partition_var=some_value",
filesystem=gcs,
format="parquet",
)
```
The errors returned for both are below:
```
FileNotFoundError: path/to/partitioned/dataset/base/dir/
FileNotFoundError:
path/to/partitioned/dataset/base/dir/partition_var=some_value/
```
> Reading Hive-style partitioned parquet files from GCS
> -----------------------------------------------------
>
> Key: ARROW-14959
> URL: https://issues.apache.org/jira/browse/ARROW-14959
> Project: Apache Arrow
> Issue Type: Bug
> Affects Versions: 6.0.1
> Reporter: Garrett Weaver
> Priority: Minor
>
> Trying to read a spark-generated hive-style partitioned parquet dataset with
> `gcsfs` and `pyarrow`, but getting a `FileNotFoundError` if I try to read
> from the base directory or even if try to read directly from one of the
> partitions. Not sure if I am doing something wrong or it is not supported.
> Note that I have successfully read this hive-style partitioned parquet
> dataset using other methods to rule out any other issues, including:
> * Successful read with pyspark by using spark.read.parquet
> * Successful read of a specific partition by passing list of paths to
> `ParquetDataset`
> * Also, tested reading another spark-generated parquet dataset with no
> Hive-style partitions from GCS and that worked as well.
> Below is what I am trying:
>
> {{{code:python}}}
> import gcsfs
> import pyarrow as pa
>
> gcs = gcsfs.GCSFileSystem()
>
> partitions = pa.dataset.partitioning(
> pa.schema([("partition_var", pa.string())]), dictionaries="infer",
> flavor="hive"
> )
>
> partitioned_dataset = pa.dataset.dataset(
> "path/to/partitioned/dataset/base/dir",
> filesystem=gcs,
> format="parquet",
> partitioning=partitions,
> )
>
> partition_of_dataset = pa.dataset.dataset(
> "path/to/partitioned/dataset/base/dir/partition_var=some_value",
> filesystem=gcs,
> format="parquet",
> )
> {{{code}}}
>
> The errors returned for both are below:
> ```
> FileNotFoundError: path/to/partitioned/dataset/base/dir/
> FileNotFoundError:
> path/to/partitioned/dataset/base/dir/partition_var=some_value/
> ```
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)