Hi Micah,

Thanks for your comments.
I agree that the implementation in vector module has its advantages and
merits.

Per discussion with Ji, we want to design/implement/migrate the dictionary
encoder according to the following plan:

1. The static method version of the encoder is kept in the vector module.
It will be deprecated later.

2. A object oriented encoder based on 1 should be provided, and should be
placed into the algorithm module. For now, this implementation needs to
reference the implementation in 1.

3. The current hash table based encoder [1] is also preserved, because it
also has its advantages and merits. However, 2 and 3 should use the same
interface.

4. Gradually, we merge the implementations in 2 and 3, taking advantages of
both.

5. The search based dictionary encoder [2] is left as is.

What do you think?

Best,
Liya Fan


[1] https://github.com/apache/arrow/pull/5058
[2] https://github.com/apache/arrow/pull/4994

On Tue, Aug 13, 2019 at 11:15 AM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> Hi Liya Fan,
> Ji Liu has an open pull request [1] that refactors the existing
> implementation to address the re-use aspect.  I think it can also be
> extended to fix the memory ownership problem you highlighted.  More work
> would need to be address to address the customizable hash customizable
> hash.  Could you two please work together to figure out how to reconcile
> the following differences:
>
> 1.  The new implementations you referenced, require access to an
> ArrowBufPointer which precludes the usage on complex types.  The existing
> implementation works with complex types.
>
> 2.  The existing implementation has a customized hash table that avoids the
> need for boxing/unboxing.  If I remember correctly I think this showed
> approximately 3-5% performance improvement in encoding.  In both cases, it
> would probably be nice to move to an off-heap solution.
>
> Also, for removing the old encoder implementation could you provide more
> details?  The current encoder is used in the Vector module in unit tests at
> least, and the new encoders are in the algorithm package.  How do you plan
> on resolving the dependencies?
>
> [1]  https://github.com/apache/arrow/pull/5055/files
>
> On Sun, Aug 11, 2019 at 1:18 AM Fan Liya <liya.fa...@gmail.com> wrote:
>
> > Dear all,
> >
> > Dictionary encoding is an important feature, so it should be implemented
> > with good performance.
> > The current Java dictionary encoder implementation is based on static
> > utility methods in org.apache.arrow.vector.dictionary.DictionaryEncoder,
> > which has heavy performance overhead, preventing it from being useful in
> > practice:
> >
> > 1. The hash table cannot be reused for encoding multiple vectors (other
> > data structure & results cannot be reused either).
> > 2. The output vector should not be created/managed by the encoder (just
> > like in the out-of-place sorter)
> > 3. Different scenarios requires different algorithms to compute the hash
> > code to avoid conflicts in the hash table, but this is not supported.
> >
> > Although some problems can be overcome by refactoring the current
> > implementation, it is difficult to do so without significantly chaning
> the
> > current API.
> > So we propse new design [1][2] of the dictionary encoder, to make it more
> > performant in practice.
> >
> > We plan to implement the new dictionary encoders with stateful objects,
> so
> > many useful partial/immediate results can be reused. The new encoders
> > support using different hash code algorithms in different scenarios to
> > achieve good performance.
> >
> > We plan to support the new encoders in the following steps:
> >
> > 1. implement the new dictionary encoders in the algorithm module [3][4]
> > 2. make the old dictionary encoder deprecated
> > 3. remove the old encoder implementations
> >
> > Please give your valuable comments.
> >
> > Best,
> > Liya Fan
> >
> > [1] https://issues.apache.org/jira/browse/ARROW-5917
> > [2] https://issues.apache.org/jira/browse/ARROW-6184
> > [3] https://github.com/apache/arrow/pull/4994
> > [4] https://github.com/apache/arrow/pull/5058
> >
>

Reply via email to