Dear all,

Dictionary encoding is an important feature, so it should be implemented
with good performance.
The current Java dictionary encoder implementation is based on static
utility methods in org.apache.arrow.vector.dictionary.DictionaryEncoder,
which has heavy performance overhead, preventing it from being useful in
practice:

1. The hash table cannot be reused for encoding multiple vectors (other
data structure & results cannot be reused either).
2. The output vector should not be created/managed by the encoder (just
like in the out-of-place sorter)
3. Different scenarios requires different algorithms to compute the hash
code to avoid conflicts in the hash table, but this is not supported.

Although some problems can be overcome by refactoring the current
implementation, it is difficult to do so without significantly chaning the
current API.
So we propse new design [1][2] of the dictionary encoder, to make it more
performant in practice.

We plan to implement the new dictionary encoders with stateful objects, so
many useful partial/immediate results can be reused. The new encoders
support using different hash code algorithms in different scenarios to
achieve good performance.

We plan to support the new encoders in the following steps:

1. implement the new dictionary encoders in the algorithm module [3][4]
2. make the old dictionary encoder deprecated
3. remove the old encoder implementations

Please give your valuable comments.

Best,
Liya Fan

[1] https://issues.apache.org/jira/browse/ARROW-5917
[2] https://issues.apache.org/jira/browse/ARROW-6184
[3] https://github.com/apache/arrow/pull/4994
[4] https://github.com/apache/arrow/pull/5058

Reply via email to