[
https://issues.apache.org/jira/browse/ARROW-10099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ben Kietzman resolved ARROW-10099.
----------------------------------
Resolution: Fixed
Issue resolved by pull request 8367
[https://github.com/apache/arrow/pull/8367]
> [C++][Dataset] Also allow integer partition fields to be dictionary encoded
> ---------------------------------------------------------------------------
>
> Key: ARROW-10099
> URL: https://issues.apache.org/jira/browse/ARROW-10099
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Joris Van den Bossche
> Assignee: Ben Kietzman
> Priority: Major
> Labels: dataset, dataset-dask-integration, pull-request-available
> Fix For: 2.0.0
>
> Time Spent: 1h
> Remaining Estimate: 0h
>
> In ARROW-8647, we added the option to indicate that you partition field
> columns should be dictionary encoded, but it currently does only do this for
> string type, and not for integer type (wiht the reasoning that for integers,
> it is not giving any memory efficiency gains to use dictionary encoding).
> In dask, they have been using categorical dtypes for _all_ partition fields,
> also if they are integers. They would like to keep doing this (apart from
> memory efficiency, using categorical/dictionary type also gives information
> about all uniques values of the column, without having to calculate this), so
> it would be nice to enable this use case.
> So I think we could either simply always dictionary encode also integers when
> {{max_partition_dictionary_size}} indicates partition fields should be
> dictionary encoded, or either have an additional option to indicate also
> integer partition fields should be encoded (if the other option indicates
> dictionary encoding should be used).
> Based on feedback from the dask PR using the dataset API at
> https://github.com/dask/dask/pull/6534#issuecomment-698723009
> cc [~rjzamora] [~bkietz]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)