richardstartin commented on pull request #8189: URL: https://github.com/apache/pinot/pull/8189#issuecomment-1036521820
> @richardstartin Good suggestion on storing values in a bitmap for better performance and lower memory footprint. Is my understanding correct that in the worst case, for 32 bit values, we will use up to 16 bit per value storing them in a bitmap (not including metadata)? For 64 bit values, does long-bitmap gives better performance for sparse values? > > Before hitting the threshold, we do want to keep the 100% accurate result because we want to use this function as a replacement of the current `DISTINCT_COUNT` in certain environments (configurable) The worst case depends on the size of the set. The absolute worst case is more than 32 bits per value, this would happen if you had 2^16 values with a gap of roughly 2^16 between each value in the set. The worst case for a set more than 2^16 values decreases monotonically. If we have to maintain absolute accuracy below the threshold, we can't truncate `double` to `float`, but hopefully users don't want to distinct count `double`s anyway, and it's a meaningless operation given the nature of floating point numbers. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
