[ 
https://issues.apache.org/jira/browse/ARROW-3388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094334#comment-17094334
 ] 

Joris Van den Bossche edited comment on ARROW-3388 at 4/28/20, 9:36 AM:
------------------------------------------------------------------------

Correction: it actually does work with specifying a schema (I just forgot to 
specify {{flavor="hive"}} for my example)

{code:python}
# create partitioned table with boolean partition keys
path = pathlib.Path(".") / "dataset_boolean_partition"
path.mkdir(exist_ok=True)
table = pa.table({"part": [True, True, False, False], "values": range(4)})
pq.write_to_dataset(table, str(path), partition_cols=["part"])

# legacy -> gives dictionary column of string values
pq.ParquetDataset(case).read()

# new API -> gives string column
pq.ParquetDataset(case, use_legacy_dataset=False).read()

# specify schema -> actually get a column of bools
partitioning = ds.partitioning(pa.schema([('part', pa.bool_())]), flavor="hive")
pq.ParquetDataset(case, use_legacy_dataset=False, 
partitioning=partitioning).read()
pq.ParquetDataset(case, use_legacy_dataset=False, 
partitioning=partitioning).read().to_pandas()
{code}

Are we fine with requiring the user to pass a manual schema here? Or do we 
actually still want to automatically infer True/False to be boolean type?

cc [~fsaintjacques] [~bkietz]


was (Author: jorisvandenbossche):
Correction: it actually does work with specifying a schema (I just forgot to 
specify {{flavor="hive"}} for my example)

{code:python}
# create partitioned table with boolean partition keys
path = pathlib.Path(".") / "dataset_boolean_partition"
path.mkdir(exist_ok=True)
table = pa.table({"part": [True, True, False, False], "values": range(4)})
pq.write_to_dataset(table, str(path), partition_cols=["part"])


# legacy -> gives dictionary column of string values
pq.ParquetDataset(case).read()

# new API -> gives string column
pq.ParquetDataset(case, use_legacy_dataset=False).read()

# specify schema -> actually get a column of bools
partitioning = ds.partitioning(pa.schema([('part', pa.bool_())]), flavor="hive")
pq.ParquetDataset(case, use_legacy_dataset=False, 
partitioning=partitioning).read()
pq.ParquetDataset(case, use_legacy_dataset=False, 
partitioning=partitioning).read().to_pandas()
{code}

Are we fine with requiring the user to pass a manual schema here? Or do we 
actually still want to automatically infer True/False to be boolean type?

cc [~fsaintjacques] [~bkietz]

> [Python] boolean Partition keys in ParquetDataset are reconstructed as string
> -----------------------------------------------------------------------------
>
>                 Key: ARROW-3388
>                 URL: https://issues.apache.org/jira/browse/ARROW-3388
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Uwe Korn
>            Priority: Major
>              Labels: dataset, dataset-parquet-read, parquet
>
> Saving a {{ParquetDataset}} using a boolean column as a partitioning column 
> will store {{True/False}} as the values in the path. On reload these columns 
> will then be string columns with the values {{'True'}} and {{'False'}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to