Hi Micah, Thanks for your comments. I agree that the implementation in vector module has its advantages and merits.
Per discussion with Ji, we want to design/implement/migrate the dictionary encoder according to the following plan: 1. The static method version of the encoder is kept in the vector module. It will be deprecated later. 2. A object oriented encoder based on 1 should be provided, and should be placed into the algorithm module. For now, this implementation needs to reference the implementation in 1. 3. The current hash table based encoder [1] is also preserved, because it also has its advantages and merits. However, 2 and 3 should use the same interface. 4. Gradually, we merge the implementations in 2 and 3, taking advantages of both. 5. The search based dictionary encoder [2] is left as is. What do you think? Best, Liya Fan [1] https://github.com/apache/arrow/pull/5058 [2] https://github.com/apache/arrow/pull/4994 On Tue, Aug 13, 2019 at 11:15 AM Micah Kornfield <emkornfi...@gmail.com> wrote: > Hi Liya Fan, > Ji Liu has an open pull request [1] that refactors the existing > implementation to address the re-use aspect. I think it can also be > extended to fix the memory ownership problem you highlighted. More work > would need to be address to address the customizable hash customizable > hash. Could you two please work together to figure out how to reconcile > the following differences: > > 1. The new implementations you referenced, require access to an > ArrowBufPointer which precludes the usage on complex types. The existing > implementation works with complex types. > > 2. The existing implementation has a customized hash table that avoids the > need for boxing/unboxing. If I remember correctly I think this showed > approximately 3-5% performance improvement in encoding. In both cases, it > would probably be nice to move to an off-heap solution. > > Also, for removing the old encoder implementation could you provide more > details? The current encoder is used in the Vector module in unit tests at > least, and the new encoders are in the algorithm package. How do you plan > on resolving the dependencies? > > [1] https://github.com/apache/arrow/pull/5055/files > > On Sun, Aug 11, 2019 at 1:18 AM Fan Liya <liya.fa...@gmail.com> wrote: > > > Dear all, > > > > Dictionary encoding is an important feature, so it should be implemented > > with good performance. > > The current Java dictionary encoder implementation is based on static > > utility methods in org.apache.arrow.vector.dictionary.DictionaryEncoder, > > which has heavy performance overhead, preventing it from being useful in > > practice: > > > > 1. The hash table cannot be reused for encoding multiple vectors (other > > data structure & results cannot be reused either). > > 2. The output vector should not be created/managed by the encoder (just > > like in the out-of-place sorter) > > 3. Different scenarios requires different algorithms to compute the hash > > code to avoid conflicts in the hash table, but this is not supported. > > > > Although some problems can be overcome by refactoring the current > > implementation, it is difficult to do so without significantly chaning > the > > current API. > > So we propse new design [1][2] of the dictionary encoder, to make it more > > performant in practice. > > > > We plan to implement the new dictionary encoders with stateful objects, > so > > many useful partial/immediate results can be reused. The new encoders > > support using different hash code algorithms in different scenarios to > > achieve good performance. > > > > We plan to support the new encoders in the following steps: > > > > 1. implement the new dictionary encoders in the algorithm module [3][4] > > 2. make the old dictionary encoder deprecated > > 3. remove the old encoder implementations > > > > Please give your valuable comments. > > > > Best, > > Liya Fan > > > > [1] https://issues.apache.org/jira/browse/ARROW-5917 > > [2] https://issues.apache.org/jira/browse/ARROW-6184 > > [3] https://github.com/apache/arrow/pull/4994 > > [4] https://github.com/apache/arrow/pull/5058 > > >