Dear all, Dictionary encoding is an important feature, so it should be implemented with good performance. The current Java dictionary encoder implementation is based on static utility methods in org.apache.arrow.vector.dictionary.DictionaryEncoder, which has heavy performance overhead, preventing it from being useful in practice:
1. The hash table cannot be reused for encoding multiple vectors (other data structure & results cannot be reused either). 2. The output vector should not be created/managed by the encoder (just like in the out-of-place sorter) 3. Different scenarios requires different algorithms to compute the hash code to avoid conflicts in the hash table, but this is not supported. Although some problems can be overcome by refactoring the current implementation, it is difficult to do so without significantly chaning the current API. So we propse new design [1][2] of the dictionary encoder, to make it more performant in practice. We plan to implement the new dictionary encoders with stateful objects, so many useful partial/immediate results can be reused. The new encoders support using different hash code algorithms in different scenarios to achieve good performance. We plan to support the new encoders in the following steps: 1. implement the new dictionary encoders in the algorithm module [3][4] 2. make the old dictionary encoder deprecated 3. remove the old encoder implementations Please give your valuable comments. Best, Liya Fan [1] https://issues.apache.org/jira/browse/ARROW-5917 [2] https://issues.apache.org/jira/browse/ARROW-6184 [3] https://github.com/apache/arrow/pull/4994 [4] https://github.com/apache/arrow/pull/5058