GWphua commented on code in PR #18021: URL: https://github.com/apache/druid/pull/18021#discussion_r2125869364
########## docs/development/extensions-contrib/druid-exact-cardinality.md: ########## @@ -0,0 +1,443 @@ +--- +id: druid-exact-cardinality +title: "Exact Cardinality" +--- + +<!-- + ~ Licensed to the Apache Software Foundation (ASF) under one + ~ or more contributor license agreements. See the NOTICE file + ~ distributed with this work for additional information + ~ regarding copyright ownership. The ASF licenses this file + ~ to you under the Apache License, Version 2.0 (the + ~ "License"); you may not use this file except in compliance + ~ with the License. You may obtain a copy of the License at + ~ + ~ http://www.apache.org/licenses/LICENSE-2.0 + ~ + ~ Unless required by applicable law or agreed to in writing, + ~ software distributed under the License is distributed on an + ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + ~ KIND, either express or implied. See the License for the + ~ specific language governing permissions and limitations + ~ under the License. + --> + +This extension provides exact cardinality counting functionality for LONG type columns using [Roaring Bitmaps](https://roaringbitmap.org/). Unlike approximate cardinality aggregators like HyperLogLog, this aggregator provides precise distinct counts. + +## Installation + +To use this Apache Druid extension, [include](../../configuration/extensions.md#loading-extensions) `druid-exact-cardinality` in the extensions load list. + +## How it Works + +The extension uses `Roaring64NavigableMap` as its underlying data structure to efficiently store and compute exact cardinality of 64-bit integers. It provides two types of aggregators that serve different purposes: + +### Build Aggregator (Bitmap64ExactCardinalityBuild) + +The BUILD aggregator is used when you want to compute cardinality directly from raw LONG values: + +- Used during ingestion or when querying raw data +- Must be used on columns of type LONG, otherwise the output will be 1. + +Example: + +```json +{ + "type": "Bitmap64ExactCardinalityBuild", + "name": "unique_values", + "fieldName": "id" +} +``` + +### Merge Aggregator (Bitmap64ExactCardinalityMerge) + +The MERGE aggregator is used when working with pre-computed bitmaps: + +- Used for querying pre-aggregated data (columns that were previously aggregated using BUILD) +- Combines multiple bitmaps using bitwise operations +- Must be used on columns that are aggregated using BUILD +- `Bitmap64ExactCardinalityMerge` aggregator is recommended for use in `timeseries` type queries, though it also works for `topN` and `groupBy` queries. + +Example: + +```json +{ + "type": "Bitmap64ExactCardinalityMerge", + "name": "total_unique_values", + "fieldName": "unique_values" // Must be a pre-computed bitmap +} +``` + +### Typical Workflow + +1. During ingestion, use BUILD to create the initial bitmap: + ```json + { + "type": "index", + "spec": { + "dataSchema": { + "metricsSpec": [ + { + "type": "Bitmap64ExactCardinalityBuild", + "name": "unique_users", + "fieldName": "user_id" + } + ] + } + } + } + ``` + +2. When querying the aggregated data, use MERGE to combine bitmaps: + ```json + { + "queryType": "timeseries", + "aggregations": [ + { + "type": "Bitmap64ExactCardinalityMerge", + "name": "total_unique_users", + "fieldName": "unique_users" + } + ] + } + ``` + +## Usage + +### SQL Query + +You can use the `BITMAP64_EXACT_CARDINALITY` function in SQL queries: Review Comment: Yes, as long as the column is a Long column, it can be used without needing to pre-aggregate the column. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
