Weston Pace created ARROW-11732:
-----------------------------------
Summary: [C++] DictionaryEncode should convert dictionaries from
one type of encoding to the other
Key: ARROW-11732
URL: https://issues.apache.org/jira/browse/ARROW-11732
Project: Apache Arrow
Issue Type: Improvement
Components: C++
Reporter: Weston Pace
There are two styles of encoding nulls in dictionaries (masked or encoded). In
compute:: DictionaryEncode this is controlled by an option. Today, if you pass
a dictionary into DictionaryEncode it is a no-op.
Instead it should check to see if the dictionary is properly encoded (this is
easily checked in constant time) according to the requested null encoding
scheme and, if not, it should convert it.
The default NullEncodingBehavior should also change to EXISTING_OR_ENCODE or a
second option should be added so that this doesn't change existing behavior.
Once this is done then partition.cc could be improved. It currently requires
dictionaries use "encoded nulls" and, if a dictionary is passed in that uses
"masked nulls" then it uncodes and re-encodes the dictionary which is a
potentially costly operation. This could be fixed to use the conversion.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)