[jira] [Updated] (ARROW-8647) [C++][Dataset] Optionally encode partition field values as dictionary type

Joris Van den Bossche (Jira) Thu, 30 Apr 2020 06:48:30 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-8647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Joris Van den Bossche updated ARROW-8647:
-----------------------------------------
    Description: 
In the Python ParquetDataset implementation, the partition fields are returned 
as dictionary type columns. 

In the new Dataset API, we now use a plain type (integer or string when 
inferred). But, you can already manually specify that the partition keys should 
be dictionary type by specifying the partitioning schema (in {{Partitioning}} 
passed to the dataset factory). 

Since using dictionary type can be more efficient (since partition keys will 
typically be repeated values in the resulting table), it might be good to still 
have an option in the DatasetFactory to use dictionary types for the partition 
fields.

See also https://github.com/apache/arrow/pull/6303#discussion_r400622340

  was:
In the Python ParquetDataset implementation, the partition fields are returned 
as dictionary type columns. 

In the new Dataset API, we now use a plain type (integer or string when 
inferred). But, you can already manually specify that the partition keys should 
be dictionary type by specifying the partitioning schema (in {{Partitioning}} 
passed to the dataset factory). 

Since using dictionary type can be more efficient (since partition keys will 
typically be repeated values in the resulting table), it might be good to still 
have an option in the DatasetFactory to use dictionary types for the partition 
fields.


> [C++][Dataset] Optionally encode partition field values as dictionary type
> --------------------------------------------------------------------------
>
>                 Key: ARROW-8647
>                 URL: https://issues.apache.org/jira/browse/ARROW-8647
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Priority: Major
>             Fix For: 1.0.0
>
>
> In the Python ParquetDataset implementation, the partition fields are 
> returned as dictionary type columns. 
> In the new Dataset API, we now use a plain type (integer or string when 
> inferred). But, you can already manually specify that the partition keys 
> should be dictionary type by specifying the partitioning schema (in 
> {{Partitioning}} passed to the dataset factory). 
> Since using dictionary type can be more efficient (since partition keys will 
> typically be repeated values in the resulting table), it might be good to 
> still have an option in the DatasetFactory to use dictionary types for the 
> partition fields.
> See also https://github.com/apache/arrow/pull/6303#discussion_r400622340



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8647) [C++][Dataset] Optionally encode partition field values as dictionary type

Reply via email to