richardstartin opened a new pull request #7802: URL: https://github.com/apache/pinot/pull/7802
I have been working with a user to understand why distinctcounts of string values with group-bys are slow, and there appear to be two issues: 1. It's necessary to dedictionarize segment level bitmaps of dictionary ids before producing intermediate results. This manifests itself in the set construction itself (line 1), reading the string values (line 3, which is mostly solved by #7708 in 0.9.0), and hashing the newly created strings to put in to the set (lines 4 and 6). 2. Appending dictionary ids to the segment level bitmaps, which results in binary searches (line 2) <img width="752" alt="Screenshot 2021-11-19 at 16 31 01" src="https://user-images.githubusercontent.com/16439049/142657826-a7dc3e29-0eb4-4e7e-9f75-2a9cab6edf1f.png"> When sorting by the distinctcounted field, surprisingly, the contribution of binary searches gets worse: <img width="759" alt="Screenshot 2021-11-19 at 16 33 59" src="https://user-images.githubusercontent.com/16439049/142658237-3b1a5a41-299e-488b-9b7e-feba08bfdec1.png"> This change accumulates dictionary ids in a small bitset (8KB) which is flushed to the dictionary id bitmap whenever a dictionary id in a different 16 bit interval is encountered, to avoid doing binary searches. Whenever the cardinality of the counted column is less than 2^16, this is enough capacity to accumulate the entire distinct count, and when the dictionary ids within a column _roughly_ increase with docId (+/- 2^15) the small bitset bypasses binary searches. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
