Joris Van den Bossche created ARROW-10099:
---------------------------------------------

             Summary: [C++][Dataset] Also allow integer partition fields to be 
dictionary encoded
                 Key: ARROW-10099
                 URL: https://issues.apache.org/jira/browse/ARROW-10099
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Joris Van den Bossche
             Fix For: 2.0.0


In ARROW-8647, we added the option to indicate that you partition field columns 
should be dictionary encoded, but it currently does only do this for string 
type, and not for integer type (wiht the reasoning that for integers, it is not 
giving any memory efficiency gains to use dictionary encoding). 

In dask, they have been using categorical dtypes for _all_ partition fields, 
also if they are integers. They would like to keep doing this (apart from 
memory efficiency, using categorical/dictionary type also gives information 
about all uniques values of the column, without having to calculate this), so 
it would be nice to enable this use case. 

So I think we could either simply always dictionary encode also integers when 
{{max_partition_dictionary_size}} indicates partition fields should be 
dictionary encoded, or either have an additional option to indicate also 
integer partition fields should be encoded (if the other option indicates 
dictionary encoding should be used).

cc [~rjzamora] [~bkietz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to