cyb70289 opened a new pull request #8091:
URL: https://github.com/apache/arrow/pull/8091
… value range
For int16/32/64 arrays with reasonable length, scan the array to find
min/max values first. If (max-min) is within some threshold, instead
of general hashmap, using a value indexed array can improve performance
significantly.
To be compatible with chunked array, value count array is transferred to
hashmap before merging with others. This is an overhead for short array.
Finding min/max may also introduce performance penalty in some cases.
Please note it's hard to benefit all use cases. By applying this patch:
- about 2x performance uplift for integers in small value range
- no obvious performance drop for normal cases
- non-trivial performance drop in some cases
* 40% drop for short int8 array (8k length)
* 10% drop for sparse array (few distinct values, big value gap)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]