Re: [Discussion][Java] Redesign the dictionary encoder

2019-08-13 Thread Fan Liya
Hi Micah,

Thanks for your comments.
I agree that the implementation in vector module has its advantages and
merits.

Per discussion with Ji, we want to design/implement/migrate the dictionary
encoder according to the following plan:

1. The static method version of the encoder is kept in the vector module.
It will be deprecated later.

2. A object oriented encoder based on 1 should be provided, and should be
placed into the algorithm module. For now, this implementation needs to
reference the implementation in 1.

3. The current hash table based encoder [1] is also preserved, because it
also has its advantages and merits. However, 2 and 3 should use the same
interface.

4. Gradually, we merge the implementations in 2 and 3, taking advantages of
both.

5. The search based dictionary encoder [2] is left as is.

What do you think?

Best,
Liya Fan


[1] https://github.com/apache/arrow/pull/5058
[2] https://github.com/apache/arrow/pull/4994

On Tue, Aug 13, 2019 at 11:15 AM Micah Kornfield 
wrote:

> Hi Liya Fan,
> Ji Liu has an open pull request [1] that refactors the existing
> implementation to address the re-use aspect.  I think it can also be
> extended to fix the memory ownership problem you highlighted.  More work
> would need to be address to address the customizable hash customizable
> hash.  Could you two please work together to figure out how to reconcile
> the following differences:
>
> 1.  The new implementations you referenced, require access to an
> ArrowBufPointer which precludes the usage on complex types.  The existing
> implementation works with complex types.
>
> 2.  The existing implementation has a customized hash table that avoids the
> need for boxing/unboxing.  If I remember correctly I think this showed
> approximately 3-5% performance improvement in encoding.  In both cases, it
> would probably be nice to move to an off-heap solution.
>
> Also, for removing the old encoder implementation could you provide more
> details?  The current encoder is used in the Vector module in unit tests at
> least, and the new encoders are in the algorithm package.  How do you plan
> on resolving the dependencies?
>
> [1]  https://github.com/apache/arrow/pull/5055/files
>
> On Sun, Aug 11, 2019 at 1:18 AM Fan Liya  wrote:
>
> > Dear all,
> >
> > Dictionary encoding is an important feature, so it should be implemented
> > with good performance.
> > The current Java dictionary encoder implementation is based on static
> > utility methods in org.apache.arrow.vector.dictionary.DictionaryEncoder,
> > which has heavy performance overhead, preventing it from being useful in
> > practice:
> >
> > 1. The hash table cannot be reused for encoding multiple vectors (other
> > data structure & results cannot be reused either).
> > 2. The output vector should not be created/managed by the encoder (just
> > like in the out-of-place sorter)
> > 3. Different scenarios requires different algorithms to compute the hash
> > code to avoid conflicts in the hash table, but this is not supported.
> >
> > Although some problems can be overcome by refactoring the current
> > implementation, it is difficult to do so without significantly chaning
> the
> > current API.
> > So we propse new design [1][2] of the dictionary encoder, to make it more
> > performant in practice.
> >
> > We plan to implement the new dictionary encoders with stateful objects,
> so
> > many useful partial/immediate results can be reused. The new encoders
> > support using different hash code algorithms in different scenarios to
> > achieve good performance.
> >
> > We plan to support the new encoders in the following steps:
> >
> > 1. implement the new dictionary encoders in the algorithm module [3][4]
> > 2. make the old dictionary encoder deprecated
> > 3. remove the old encoder implementations
> >
> > Please give your valuable comments.
> >
> > Best,
> > Liya Fan
> >
> > [1] https://issues.apache.org/jira/browse/ARROW-5917
> > [2] https://issues.apache.org/jira/browse/ARROW-6184
> > [3] https://github.com/apache/arrow/pull/4994
> > [4] https://github.com/apache/arrow/pull/5058
> >
>


Re: [Discussion][Java] Redesign the dictionary encoder

2019-08-12 Thread Micah Kornfield
Hi Liya Fan,
Ji Liu has an open pull request [1] that refactors the existing
implementation to address the re-use aspect.  I think it can also be
extended to fix the memory ownership problem you highlighted.  More work
would need to be address to address the customizable hash customizable
hash.  Could you two please work together to figure out how to reconcile
the following differences:

1.  The new implementations you referenced, require access to an
ArrowBufPointer which precludes the usage on complex types.  The existing
implementation works with complex types.

2.  The existing implementation has a customized hash table that avoids the
need for boxing/unboxing.  If I remember correctly I think this showed
approximately 3-5% performance improvement in encoding.  In both cases, it
would probably be nice to move to an off-heap solution.

Also, for removing the old encoder implementation could you provide more
details?  The current encoder is used in the Vector module in unit tests at
least, and the new encoders are in the algorithm package.  How do you plan
on resolving the dependencies?

[1]  https://github.com/apache/arrow/pull/5055/files

On Sun, Aug 11, 2019 at 1:18 AM Fan Liya  wrote:

> Dear all,
>
> Dictionary encoding is an important feature, so it should be implemented
> with good performance.
> The current Java dictionary encoder implementation is based on static
> utility methods in org.apache.arrow.vector.dictionary.DictionaryEncoder,
> which has heavy performance overhead, preventing it from being useful in
> practice:
>
> 1. The hash table cannot be reused for encoding multiple vectors (other
> data structure & results cannot be reused either).
> 2. The output vector should not be created/managed by the encoder (just
> like in the out-of-place sorter)
> 3. Different scenarios requires different algorithms to compute the hash
> code to avoid conflicts in the hash table, but this is not supported.
>
> Although some problems can be overcome by refactoring the current
> implementation, it is difficult to do so without significantly chaning the
> current API.
> So we propse new design [1][2] of the dictionary encoder, to make it more
> performant in practice.
>
> We plan to implement the new dictionary encoders with stateful objects, so
> many useful partial/immediate results can be reused. The new encoders
> support using different hash code algorithms in different scenarios to
> achieve good performance.
>
> We plan to support the new encoders in the following steps:
>
> 1. implement the new dictionary encoders in the algorithm module [3][4]
> 2. make the old dictionary encoder deprecated
> 3. remove the old encoder implementations
>
> Please give your valuable comments.
>
> Best,
> Liya Fan
>
> [1] https://issues.apache.org/jira/browse/ARROW-5917
> [2] https://issues.apache.org/jira/browse/ARROW-6184
> [3] https://github.com/apache/arrow/pull/4994
> [4] https://github.com/apache/arrow/pull/5058
>


[Discussion][Java] Redesign the dictionary encoder

2019-08-11 Thread Fan Liya
Dear all,

Dictionary encoding is an important feature, so it should be implemented
with good performance.
The current Java dictionary encoder implementation is based on static
utility methods in org.apache.arrow.vector.dictionary.DictionaryEncoder,
which has heavy performance overhead, preventing it from being useful in
practice:

1. The hash table cannot be reused for encoding multiple vectors (other
data structure & results cannot be reused either).
2. The output vector should not be created/managed by the encoder (just
like in the out-of-place sorter)
3. Different scenarios requires different algorithms to compute the hash
code to avoid conflicts in the hash table, but this is not supported.

Although some problems can be overcome by refactoring the current
implementation, it is difficult to do so without significantly chaning the
current API.
So we propse new design [1][2] of the dictionary encoder, to make it more
performant in practice.

We plan to implement the new dictionary encoders with stateful objects, so
many useful partial/immediate results can be reused. The new encoders
support using different hash code algorithms in different scenarios to
achieve good performance.

We plan to support the new encoders in the following steps:

1. implement the new dictionary encoders in the algorithm module [3][4]
2. make the old dictionary encoder deprecated
3. remove the old encoder implementations

Please give your valuable comments.

Best,
Liya Fan

[1] https://issues.apache.org/jira/browse/ARROW-5917
[2] https://issues.apache.org/jira/browse/ARROW-6184
[3] https://github.com/apache/arrow/pull/4994
[4] https://github.com/apache/arrow/pull/5058