Hi Liya Fan,
Ji Liu has an open pull request [1] that refactors the existing
implementation to address the re-use aspect.  I think it can also be
extended to fix the memory ownership problem you highlighted.  More work
would need to be address to address the customizable hash customizable
hash.  Could you two please work together to figure out how to reconcile
the following differences:

1.  The new implementations you referenced, require access to an
ArrowBufPointer which precludes the usage on complex types.  The existing
implementation works with complex types.

2.  The existing implementation has a customized hash table that avoids the
need for boxing/unboxing.  If I remember correctly I think this showed
approximately 3-5% performance improvement in encoding.  In both cases, it
would probably be nice to move to an off-heap solution.

Also, for removing the old encoder implementation could you provide more
details?  The current encoder is used in the Vector module in unit tests at
least, and the new encoders are in the algorithm package.  How do you plan
on resolving the dependencies?

[1]  https://github.com/apache/arrow/pull/5055/files

On Sun, Aug 11, 2019 at 1:18 AM Fan Liya <liya.fa...@gmail.com> wrote:

> Dear all,
>
> Dictionary encoding is an important feature, so it should be implemented
> with good performance.
> The current Java dictionary encoder implementation is based on static
> utility methods in org.apache.arrow.vector.dictionary.DictionaryEncoder,
> which has heavy performance overhead, preventing it from being useful in
> practice:
>
> 1. The hash table cannot be reused for encoding multiple vectors (other
> data structure & results cannot be reused either).
> 2. The output vector should not be created/managed by the encoder (just
> like in the out-of-place sorter)
> 3. Different scenarios requires different algorithms to compute the hash
> code to avoid conflicts in the hash table, but this is not supported.
>
> Although some problems can be overcome by refactoring the current
> implementation, it is difficult to do so without significantly chaning the
> current API.
> So we propse new design [1][2] of the dictionary encoder, to make it more
> performant in practice.
>
> We plan to implement the new dictionary encoders with stateful objects, so
> many useful partial/immediate results can be reused. The new encoders
> support using different hash code algorithms in different scenarios to
> achieve good performance.
>
> We plan to support the new encoders in the following steps:
>
> 1. implement the new dictionary encoders in the algorithm module [3][4]
> 2. make the old dictionary encoder deprecated
> 3. remove the old encoder implementations
>
> Please give your valuable comments.
>
> Best,
> Liya Fan
>
> [1] https://issues.apache.org/jira/browse/ARROW-5917
> [2] https://issues.apache.org/jira/browse/ARROW-6184
> [3] https://github.com/apache/arrow/pull/4994
> [4] https://github.com/apache/arrow/pull/5058
>

Reply via email to