Weston Pace created ARROW-11732:
-----------------------------------

             Summary: [C++] DictionaryEncode should convert dictionaries from 
one type of encoding to the other
                 Key: ARROW-11732
                 URL: https://issues.apache.org/jira/browse/ARROW-11732
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Weston Pace


There are two styles of encoding nulls in dictionaries (masked or encoded).  In 
compute:: DictionaryEncode this is controlled by an option.  Today, if you pass 
a dictionary into DictionaryEncode it is a no-op.

Instead it should check to see if the dictionary is properly encoded (this is 
easily checked in constant time) according to the requested null encoding 
scheme and, if not, it should convert it.

The default NullEncodingBehavior should also change to EXISTING_OR_ENCODE or a 
second option should be added so that this doesn't change existing behavior.

Once this is done then partition.cc could be improved.  It currently requires 
dictionaries use "encoded nulls" and, if a dictionary is passed in that uses 
"masked nulls" then it uncodes and re-encodes the dictionary which is a 
potentially costly operation.  This could be fixed to use the conversion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to