GWphua opened a new pull request, #18021:
URL: https://github.com/apache/druid/pull/18021

   ### Description
   
   This PR introduces the `druid-exact-cardinality` extension, providing a new 
aggregation function for computing the exact distinct count of values within a 
dimension. Unlike approximate cardinality estimators like HyperLogLog, this 
extension guarantees precision, which is crucial for use cases demanding exact 
figures.
   
   The patch achieves this by leveraging RoaringBitmap, a data structure 
optimized for storing and manipulating sets of 64-bit integers with good 
compression and performance. The extension includes the necessary components 
for integrating this functionality into Druid's query processing engine.
   
   #### Introduced Exact Cardinality Aggregation
   
   The core of this PR is the implementation of an exact cardinality aggregator 
using RoaringBitmap64.
   
   *   **Choice of algorithms and Data Structure**:
       RoaringBitmap64 was selected due to its superior performance 
characteristics for exact set representation, especially when dealing with 
64-bit hash values derived from diverse Druid column types. Its primary 
benefits include:
       *   **Exactness**: Guarantees 100% accuracy in cardinality counts, 
fulfilling the core requirement.
       *   **Compression**: Offers efficient serialization, leading to smaller 
intermediate data sizes during distributed query execution and potentially more 
compact segment storage if these bitmaps are pre-aggregated. This is vital for 
Druid's performance.
       *   **Fast Operations**: Provides highly optimized union operations, 
which are critical for the `MergeAggregator` when combining sub-results from 
different segments or nodes.
       The alternative, using standard `HashSet` objects, would be 
significantly less memory-efficient and slower for large cardinalities. While 
approximate algorithms like HLL are fast and compact, they don't meet the 
"exactness" criteria addressed by this extension.
   
   *   **Behavioral aspects**:
       The aggregator is invoked via the `bitmap64ExactCardinality` type in 
native queries or a corresponding SQL function. It's designed to ingest values 
from any dimension. Internally, these input values are typically hashed into 
64-bit long values before being added to the `RoaringBitmap64Counter`. This 
hashing ensures uniform handling of different data types.
       *   Configuration is minimal, primarily involving specifying the input 
dimension and the output name.
       *   Empty inputs or all-null inputs correctly result in a cardinality of 
0.
       *   Error handling for extreme cardinalities leading to potential 
out-of-memory situations relies on Druid's standard query failure mechanisms, 
with `BufferAggregator` variants being the preferred way to handle 
larger-than-heap datasets.
   
   #### Release note
   Introduced a new extension `druid-exact-cardinality` which provides an 
aggregator for computing exact distinct counts using RoaringBitmap64. This 
offers a precise alternative to approximate cardinality aggregators when exact 
results are required. The aggregator can be used in native queries and via a 
new SQL function ` BITMAP64_EXACT_CARDINALITY(columnName)`.
   
   <hr>
   
   ##### Key changed/added classes in this PR
    * 
`org.apache.druid.query.aggregation.exact.cardinality.bitmap64.Bitmap64ExactCardinalityAggregatorFactory`
    * 
`org.apache.druid.query.aggregation.exact.cardinality.bitmap64.Bitmap64ExactCardinalityBuildAggregatorFactory`
    * 
`org.apache.druid.query.aggregation.exact.cardinality.bitmap64.Bitmap64ExactCardinalityMergeAggregatorFactory`
    * 
`org.apache.druid.query.aggregation.exact.cardinality.bitmap64.Bitmap64ExactCardinalityBuildAggregator`
    * 
`org.apache.druid.query.aggregation.exact.cardinality.bitmap64.Bitmap64ExactCardinalityMergeAggregator`
    * 
`org.apache.druid.query.aggregation.exact.cardinality.bitmap64.Bitmap64ExactCardinalityBuildBufferAggregator`
    * 
`org.apache.druid.query.aggregation.exact.cardinality.bitmap64.Bitmap64ExactCardinalityMergeBufferAggregator`
    * 
`org.apache.druid.query.aggregation.exact.cardinality.bitmap64.RoaringBitmap64Counter`
    * 
`org.apache.druid.query.aggregation.exact.cardinality.bitmap64.Bitmap64ExactCardinalityBuildComplexMetricSerde`
    * 
`org.apache.druid.query.aggregation.exact.cardinality.bitmap64.Bitmap64ExactCardinalityMergeComplexMetricSerde`
    * 
`org.apache.druid.query.aggregation.exact.cardinality.bitmap64.Bitmap64ExactCardinalityModule`
    * 
`org.apache.druid.query.aggregation.exact.cardinality.bitmap64.Bitmap64ExactCardinalityPostAggregator`
    * 
`org.apache.druid.query.aggregation.exact.cardinality.bitmap64.sql.Bitmap64ExactCardinalitySqlAggregator`
   
   
   <hr>
   
   <!-- Check the items by putting "x" in the brackets for the done things. Not 
all of these items apply to every PR. Remove the items which are not done or 
not relevant to the PR. None of the items from the checklist below are strictly 
necessary, but it would be very helpful if you at least self-review the PR. -->
   
   This PR has:
   
   - [x] been self-reviewed.
   - [x] added documentation for new or modified features or behaviors.
   - [x] a release note entry in the PR description.
   - [x] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [x] added or updated version, license, or notice information in 
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [x] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   - [x] been tested in a test Druid cluster.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to