Re: [Discussion][Java] Redesign the dictionary encoder
Hi Micah, Thanks for your comments. I agree that the implementation in vector module has its advantages and merits. Per discussion with Ji, we want to design/implement/migrate the dictionary encoder according to the following plan: 1. The static method version of the encoder is kept in the vector module. It will be deprecated later. 2. A object oriented encoder based on 1 should be provided, and should be placed into the algorithm module. For now, this implementation needs to reference the implementation in 1. 3. The current hash table based encoder [1] is also preserved, because it also has its advantages and merits. However, 2 and 3 should use the same interface. 4. Gradually, we merge the implementations in 2 and 3, taking advantages of both. 5. The search based dictionary encoder [2] is left as is. What do you think? Best, Liya Fan [1] https://github.com/apache/arrow/pull/5058 [2] https://github.com/apache/arrow/pull/4994 On Tue, Aug 13, 2019 at 11:15 AM Micah Kornfield wrote: > Hi Liya Fan, > Ji Liu has an open pull request [1] that refactors the existing > implementation to address the re-use aspect. I think it can also be > extended to fix the memory ownership problem you highlighted. More work > would need to be address to address the customizable hash customizable > hash. Could you two please work together to figure out how to reconcile > the following differences: > > 1. The new implementations you referenced, require access to an > ArrowBufPointer which precludes the usage on complex types. The existing > implementation works with complex types. > > 2. The existing implementation has a customized hash table that avoids the > need for boxing/unboxing. If I remember correctly I think this showed > approximately 3-5% performance improvement in encoding. In both cases, it > would probably be nice to move to an off-heap solution. > > Also, for removing the old encoder implementation could you provide more > details? The current encoder is used in the Vector module in unit tests at > least, and the new encoders are in the algorithm package. How do you plan > on resolving the dependencies? > > [1] https://github.com/apache/arrow/pull/5055/files > > On Sun, Aug 11, 2019 at 1:18 AM Fan Liya wrote: > > > Dear all, > > > > Dictionary encoding is an important feature, so it should be implemented > > with good performance. > > The current Java dictionary encoder implementation is based on static > > utility methods in org.apache.arrow.vector.dictionary.DictionaryEncoder, > > which has heavy performance overhead, preventing it from being useful in > > practice: > > > > 1. The hash table cannot be reused for encoding multiple vectors (other > > data structure & results cannot be reused either). > > 2. The output vector should not be created/managed by the encoder (just > > like in the out-of-place sorter) > > 3. Different scenarios requires different algorithms to compute the hash > > code to avoid conflicts in the hash table, but this is not supported. > > > > Although some problems can be overcome by refactoring the current > > implementation, it is difficult to do so without significantly chaning > the > > current API. > > So we propse new design [1][2] of the dictionary encoder, to make it more > > performant in practice. > > > > We plan to implement the new dictionary encoders with stateful objects, > so > > many useful partial/immediate results can be reused. The new encoders > > support using different hash code algorithms in different scenarios to > > achieve good performance. > > > > We plan to support the new encoders in the following steps: > > > > 1. implement the new dictionary encoders in the algorithm module [3][4] > > 2. make the old dictionary encoder deprecated > > 3. remove the old encoder implementations > > > > Please give your valuable comments. > > > > Best, > > Liya Fan > > > > [1] https://issues.apache.org/jira/browse/ARROW-5917 > > [2] https://issues.apache.org/jira/browse/ARROW-6184 > > [3] https://github.com/apache/arrow/pull/4994 > > [4] https://github.com/apache/arrow/pull/5058 > > >
Re: [Discussion][Java] Redesign the dictionary encoder
Hi Liya Fan, Ji Liu has an open pull request [1] that refactors the existing implementation to address the re-use aspect. I think it can also be extended to fix the memory ownership problem you highlighted. More work would need to be address to address the customizable hash customizable hash. Could you two please work together to figure out how to reconcile the following differences: 1. The new implementations you referenced, require access to an ArrowBufPointer which precludes the usage on complex types. The existing implementation works with complex types. 2. The existing implementation has a customized hash table that avoids the need for boxing/unboxing. If I remember correctly I think this showed approximately 3-5% performance improvement in encoding. In both cases, it would probably be nice to move to an off-heap solution. Also, for removing the old encoder implementation could you provide more details? The current encoder is used in the Vector module in unit tests at least, and the new encoders are in the algorithm package. How do you plan on resolving the dependencies? [1] https://github.com/apache/arrow/pull/5055/files On Sun, Aug 11, 2019 at 1:18 AM Fan Liya wrote: > Dear all, > > Dictionary encoding is an important feature, so it should be implemented > with good performance. > The current Java dictionary encoder implementation is based on static > utility methods in org.apache.arrow.vector.dictionary.DictionaryEncoder, > which has heavy performance overhead, preventing it from being useful in > practice: > > 1. The hash table cannot be reused for encoding multiple vectors (other > data structure & results cannot be reused either). > 2. The output vector should not be created/managed by the encoder (just > like in the out-of-place sorter) > 3. Different scenarios requires different algorithms to compute the hash > code to avoid conflicts in the hash table, but this is not supported. > > Although some problems can be overcome by refactoring the current > implementation, it is difficult to do so without significantly chaning the > current API. > So we propse new design [1][2] of the dictionary encoder, to make it more > performant in practice. > > We plan to implement the new dictionary encoders with stateful objects, so > many useful partial/immediate results can be reused. The new encoders > support using different hash code algorithms in different scenarios to > achieve good performance. > > We plan to support the new encoders in the following steps: > > 1. implement the new dictionary encoders in the algorithm module [3][4] > 2. make the old dictionary encoder deprecated > 3. remove the old encoder implementations > > Please give your valuable comments. > > Best, > Liya Fan > > [1] https://issues.apache.org/jira/browse/ARROW-5917 > [2] https://issues.apache.org/jira/browse/ARROW-6184 > [3] https://github.com/apache/arrow/pull/4994 > [4] https://github.com/apache/arrow/pull/5058 >
[Discussion][Java] Redesign the dictionary encoder
Dear all, Dictionary encoding is an important feature, so it should be implemented with good performance. The current Java dictionary encoder implementation is based on static utility methods in org.apache.arrow.vector.dictionary.DictionaryEncoder, which has heavy performance overhead, preventing it from being useful in practice: 1. The hash table cannot be reused for encoding multiple vectors (other data structure & results cannot be reused either). 2. The output vector should not be created/managed by the encoder (just like in the out-of-place sorter) 3. Different scenarios requires different algorithms to compute the hash code to avoid conflicts in the hash table, but this is not supported. Although some problems can be overcome by refactoring the current implementation, it is difficult to do so without significantly chaning the current API. So we propse new design [1][2] of the dictionary encoder, to make it more performant in practice. We plan to implement the new dictionary encoders with stateful objects, so many useful partial/immediate results can be reused. The new encoders support using different hash code algorithms in different scenarios to achieve good performance. We plan to support the new encoders in the following steps: 1. implement the new dictionary encoders in the algorithm module [3][4] 2. make the old dictionary encoder deprecated 3. remove the old encoder implementations Please give your valuable comments. Best, Liya Fan [1] https://issues.apache.org/jira/browse/ARROW-5917 [2] https://issues.apache.org/jira/browse/ARROW-6184 [3] https://github.com/apache/arrow/pull/4994 [4] https://github.com/apache/arrow/pull/5058