[jira] [Updated] (ARROW-14959) Reading Hive-style partitioned parquet files from GCS

Garrett Weaver (Jira) Wed, 01 Dec 2021 11:45:04 -0800


     [ 
https://issues.apache.org/jira/browse/ARROW-14959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Garrett Weaver updated ARROW-14959:
-----------------------------------
    Description: 
Trying to read a spark-generated hive-style partitioned parquet dataset with 
`gcsfs` and `pyarrow`, but getting a `FileNotFoundError` if I try to read from 
the base directory or even if try to read directly from one of the partitions. 
Not sure if I am doing something wrong or it is not supported.

Note that I have successfully read this hive-style partitioned parquet dataset 
using other methods to rule out any other issues, including:
 * Successful read with pyspark by using spark.read.parquet
 * Successful read of a specific partition by passing list of paths to 
`ParquetDataset`
 * Also, tested reading another spark-generated parquet dataset with no 
Hive-style partitions from GCS and that worked as well.

Below is what I am trying:

 

{{{code:python}}}

import gcsfs
import pyarrow as pa
 
gcs = gcsfs.GCSFileSystem()
 
partitions = pa.dataset.partitioning(
    pa.schema([("partition_var", pa.string())]), dictionaries="infer", 
flavor="hive"
)
 
partitioned_dataset = pa.dataset.dataset(
    "path/to/partitioned/dataset/base/dir",
    filesystem=gcs,
    format="parquet",
    partitioning=partitions,
)
 
partition_of_dataset = pa.dataset.dataset(
    "path/to/partitioned/dataset/base/dir/partition_var=some_value",
    filesystem=gcs,
    format="parquet",
)

{{{code}}}

 

The errors returned for both are below:

```

FileNotFoundError: path/to/partitioned/dataset/base/dir/

FileNotFoundError: 
path/to/partitioned/dataset/base/dir/partition_var=some_value/

```

 

  was:
Trying to read a spark-generated hive-style partitioned parquet dataset with 
`gcsfs` and `pyarrow`, but getting a `FileNotFoundError` if I try to read from 
the base directory or even if try to read directly from one of the partitions. 
Not sure if I am doing something wrong or it is not supported.

Note that I have successfully read this hive-style partitioned parquet dataset 
using other methods to rule out any other issues, including:
 * Successful read with pyspark by using spark.read.parquet
 * Successful read of a specific partition by passing list of paths to 
`ParquetDataset`
 * Also, tested reading another spark-generated parquet dataset with no 
Hive-style partitions from GCS and that worked as well.

Below is what I am trying:

```

import gcsfs
import pyarrow as pa
 
gcs = gcsfs.GCSFileSystem()
 
partitions = pa.dataset.partitioning(
    pa.schema([("partition_var", pa.string())]), dictionaries="infer", 
flavor="hive"
)
 
partitioned_dataset = pa.dataset.dataset(
    "path/to/partitioned/dataset/base/dir",
    filesystem=gcs,
    format="parquet",
    partitioning=partitions,
)
 
partition_of_dataset = pa.dataset.dataset(
    "path/to/partitioned/dataset/base/dir/partition_var=some_value",
    filesystem=gcs,
    format="parquet",
)
```

 

The errors returned for both are below:



```

FileNotFoundError: path/to/partitioned/dataset/base/dir/

FileNotFoundError: 
path/to/partitioned/dataset/base/dir/partition_var=some_value/

```

 


> Reading Hive-style partitioned parquet files from GCS
> -----------------------------------------------------
>
>                 Key: ARROW-14959
>                 URL: https://issues.apache.org/jira/browse/ARROW-14959
>             Project: Apache Arrow
>          Issue Type: Bug
>    Affects Versions: 6.0.1
>            Reporter: Garrett Weaver
>            Priority: Minor
>
> Trying to read a spark-generated hive-style partitioned parquet dataset with 
> `gcsfs` and `pyarrow`, but getting a `FileNotFoundError` if I try to read 
> from the base directory or even if try to read directly from one of the 
> partitions. Not sure if I am doing something wrong or it is not supported.
> Note that I have successfully read this hive-style partitioned parquet 
> dataset using other methods to rule out any other issues, including:
>  * Successful read with pyspark by using spark.read.parquet
>  * Successful read of a specific partition by passing list of paths to 
> `ParquetDataset`
>  * Also, tested reading another spark-generated parquet dataset with no 
> Hive-style partitions from GCS and that worked as well.
> Below is what I am trying:
>  
> {{{code:python}}}
> import gcsfs
> import pyarrow as pa
>  
> gcs = gcsfs.GCSFileSystem()
>  
> partitions = pa.dataset.partitioning(
>     pa.schema([("partition_var", pa.string())]), dictionaries="infer", 
> flavor="hive"
> )
>  
> partitioned_dataset = pa.dataset.dataset(
>     "path/to/partitioned/dataset/base/dir",
>     filesystem=gcs,
>     format="parquet",
>     partitioning=partitions,
> )
>  
> partition_of_dataset = pa.dataset.dataset(
>     "path/to/partitioned/dataset/base/dir/partition_var=some_value",
>     filesystem=gcs,
>     format="parquet",
> )
> {{{code}}}
>  
> The errors returned for both are below:
> ```
> FileNotFoundError: path/to/partitioned/dataset/base/dir/
> FileNotFoundError: 
> path/to/partitioned/dataset/base/dir/partition_var=some_value/
> ```
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-14959) Reading Hive-style partitioned parquet files from GCS

Reply via email to