GWphua opened a new pull request, #18021:
URL: https://github.com/apache/druid/pull/18021
### Description
This PR introduces the `druid-exact-cardinality` extension, providing a new
aggregation function for computing the exact distinct count of values within a
dimension. Unlike approximate cardinality estimators like HyperLogLog, this
extension guarantees precision, which is crucial for use cases demanding exact
figures.
The patch achieves this by leveraging RoaringBitmap, a data structure
optimized for storing and manipulating sets of 64-bit integers with good
compression and performance. The extension includes the necessary components
for integrating this functionality into Druid's query processing engine.
#### Introduced Exact Cardinality Aggregation
The core of this PR is the implementation of an exact cardinality aggregator
using RoaringBitmap64.
* **Choice of algorithms and Data Structure**:
RoaringBitmap64 was selected due to its superior performance
characteristics for exact set representation, especially when dealing with
64-bit hash values derived from diverse Druid column types. Its primary
benefits include:
* **Exactness**: Guarantees 100% accuracy in cardinality counts,
fulfilling the core requirement.
* **Compression**: Offers efficient serialization, leading to smaller
intermediate data sizes during distributed query execution and potentially more
compact segment storage if these bitmaps are pre-aggregated. This is vital for
Druid's performance.
* **Fast Operations**: Provides highly optimized union operations,
which are critical for the `MergeAggregator` when combining sub-results from
different segments or nodes.
The alternative, using standard `HashSet` objects, would be
significantly less memory-efficient and slower for large cardinalities. While
approximate algorithms like HLL are fast and compact, they don't meet the
"exactness" criteria addressed by this extension.
* **Behavioral aspects**:
The aggregator is invoked via the `bitmap64ExactCardinality` type in
native queries or a corresponding SQL function. It's designed to ingest values
from any dimension. Internally, these input values are typically hashed into
64-bit long values before being added to the `RoaringBitmap64Counter`. This
hashing ensures uniform handling of different data types.
* Configuration is minimal, primarily involving specifying the input
dimension and the output name.
* Empty inputs or all-null inputs correctly result in a cardinality of
0.
* Error handling for extreme cardinalities leading to potential
out-of-memory situations relies on Druid's standard query failure mechanisms,
with `BufferAggregator` variants being the preferred way to handle
larger-than-heap datasets.
#### Release note
Introduced a new extension `druid-exact-cardinality` which provides an
aggregator for computing exact distinct counts using RoaringBitmap64. This
offers a precise alternative to approximate cardinality aggregators when exact
results are required. The aggregator can be used in native queries and via a
new SQL function ` BITMAP64_EXACT_CARDINALITY(columnName)`.
<hr>
##### Key changed/added classes in this PR
*
`org.apache.druid.query.aggregation.exact.cardinality.bitmap64.Bitmap64ExactCardinalityAggregatorFactory`
*
`org.apache.druid.query.aggregation.exact.cardinality.bitmap64.Bitmap64ExactCardinalityBuildAggregatorFactory`
*
`org.apache.druid.query.aggregation.exact.cardinality.bitmap64.Bitmap64ExactCardinalityMergeAggregatorFactory`
*
`org.apache.druid.query.aggregation.exact.cardinality.bitmap64.Bitmap64ExactCardinalityBuildAggregator`
*
`org.apache.druid.query.aggregation.exact.cardinality.bitmap64.Bitmap64ExactCardinalityMergeAggregator`
*
`org.apache.druid.query.aggregation.exact.cardinality.bitmap64.Bitmap64ExactCardinalityBuildBufferAggregator`
*
`org.apache.druid.query.aggregation.exact.cardinality.bitmap64.Bitmap64ExactCardinalityMergeBufferAggregator`
*
`org.apache.druid.query.aggregation.exact.cardinality.bitmap64.RoaringBitmap64Counter`
*
`org.apache.druid.query.aggregation.exact.cardinality.bitmap64.Bitmap64ExactCardinalityBuildComplexMetricSerde`
*
`org.apache.druid.query.aggregation.exact.cardinality.bitmap64.Bitmap64ExactCardinalityMergeComplexMetricSerde`
*
`org.apache.druid.query.aggregation.exact.cardinality.bitmap64.Bitmap64ExactCardinalityModule`
*
`org.apache.druid.query.aggregation.exact.cardinality.bitmap64.Bitmap64ExactCardinalityPostAggregator`
*
`org.apache.druid.query.aggregation.exact.cardinality.bitmap64.sql.Bitmap64ExactCardinalitySqlAggregator`
<hr>
<!-- Check the items by putting "x" in the brackets for the done things. Not
all of these items apply to every PR. Remove the items which are not done or
not relevant to the PR. None of the items from the checklist below are strictly
necessary, but it would be very helpful if you at least self-review the PR. -->
This PR has:
- [x] been self-reviewed.
- [x] added documentation for new or modified features or behaviors.
- [x] a release note entry in the PR description.
- [x] added Javadocs for most classes and all non-trivial methods. Linked
related entities via Javadoc links.
- [x] added or updated version, license, or notice information in
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
- [x] added unit tests or modified existing tests to cover new code paths,
ensuring the threshold for [code
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
is met.
- [x] been tested in a test Druid cluster.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]