[GitHub] [pinot] richardstartin opened a new pull request #7802: avoid binary searches in distinct count

GitBox Fri, 19 Nov 2021 08:39:36 -0800


richardstartin opened a new pull request #7802:
URL: https://github.com/apache/pinot/pull/7802



   I have been working with a user to understand why distinctcounts of string 
values with group-bys are slow, and there appear to be two issues:
   1. It's necessary to dedictionarize segment level bitmaps of dictionary ids 
before producing intermediate results. This manifests itself in the set 
construction itself (line 1), reading the string values (line 3, which is 
mostly solved by #7708 in 0.9.0), and hashing the newly created strings to put 
in to the set (lines 4 and 6).
   2. Appending dictionary ids to the segment level bitmaps, which results in 
binary searches (line 2)
   <img width="752" alt="Screenshot 2021-11-19 at 16 31 01" 
src="https://user-images.githubusercontent.com/16439049/142657826-a7dc3e29-0eb4-4e7e-9f75-2a9cab6edf1f.png";>
   
   When sorting by the distinctcounted field, surprisingly, the contribution of 
binary searches gets worse:
   <img width="759" alt="Screenshot 2021-11-19 at 16 33 59" 
src="https://user-images.githubusercontent.com/16439049/142658237-3b1a5a41-299e-488b-9b7e-feba08bfdec1.png";>
   
   This change accumulates dictionary ids in a small bitset (8KB) which is 
flushed to the dictionary id bitmap whenever a dictionary id in a different 16 
bit interval is encountered, to avoid doing binary searches. Whenever the 
cardinality of the counted column is less than 2^16, this is enough capacity to 
accumulate the entire distinct count, and when the dictionary ids within a 
column _roughly_ increase with docId (+/- 2^15) the small bitset bypasses 
binary searches.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [pinot] richardstartin opened a new pull request #7802: avoid binary searches in distinct count

Reply via email to