Joris Van den Bossche created ARROW-8647:
--------------------------------------------
Summary: [C++][Dataset] Optionally encode partition field values
as dictionary type
Key: ARROW-8647
URL: https://issues.apache.org/jira/browse/ARROW-8647
Project: Apache Arrow
Issue Type: Improvement
Components: C++
Reporter: Joris Van den Bossche
Fix For: 1.0.0
In the Python ParquetDataset implementation, the partition fields are returned
as dictionary type columns.
In the new Dataset API, we now use a plain type (integer or string when
inferred). But, you can already manually specify that the partition keys should
be dictionary type by specifying the partitioning schema (in {{Partitioning}}
passed to the dataset factory).
Since using dictionary type can be more efficient (since partition keys will
typically be repeated values in the resulting table), it might be good to still
have an option in the DatasetFactory to use dictionary types for the partition
fields.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)