cyb70289 opened a new pull request #8091:
URL: https://github.com/apache/arrow/pull/8091


   … value range
   
   For int16/32/64 arrays with reasonable length, scan the array to find
   min/max values first. If (max-min) is within some threshold, instead
   of general hashmap, using a value indexed array can improve performance
   significantly.
   
   To be compatible with chunked array, value count array is transferred to
   hashmap before merging with others. This is an overhead for short array.
   Finding min/max may also introduce performance penalty in some cases.
   
   Please note it's hard to benefit all use cases. By applying this patch:
   - about 2x performance uplift for integers in small value range
   - no obvious performance drop for normal cases
   - non-trivial performance drop in some cases
     * 40% drop for short int8 array (8k length)
     * 10% drop for sparse array (few distinct values, big value gap)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to