FrankChen021 commented on code in PR #18021:
URL: https://github.com/apache/druid/pull/18021#discussion_r2125418731


##########
docs/development/extensions-contrib/druid-exact-cardinality.md:
##########
@@ -0,0 +1,443 @@
+---
+id: druid-exact-cardinality
+title: "Exact Cardinality"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+This extension provides exact cardinality counting functionality for LONG type 
columns using [Roaring Bitmaps](https://roaringbitmap.org/). Unlike approximate 
cardinality aggregators like HyperLogLog, this aggregator provides precise 
distinct counts.
+

Review Comment:
   what's the difference between this extension and existing distinctCount 
extension



##########
extensions-contrib/druid-exact-cardinality/src/main/java/org/apache/druid/query/aggregation/exact/cardinality/bitmap64/Bitmap64ExactCardinalityAggregatorFactory.java:
##########
@@ -0,0 +1,203 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.query.aggregation.exact.cardinality.bitmap64;
+
+import com.fasterxml.jackson.annotation.JsonProperty;
+import org.apache.druid.query.aggregation.AggregateCombiner;
+import org.apache.druid.query.aggregation.AggregatorFactory;
+import org.apache.druid.query.aggregation.ObjectAggregateCombiner;
+import org.apache.druid.query.cache.CacheKeyBuilder;
+import org.apache.druid.segment.ColumnValueSelector;
+
+import javax.annotation.Nonnull;
+import javax.annotation.Nullable;
+import java.util.Collections;
+import java.util.Comparator;
+import java.util.List;
+import java.util.Objects;
+
+/**
+ * Base class for both build and merge factories
+ */
+public abstract class Bitmap64ExactCardinalityAggregatorFactory extends 
AggregatorFactory
+{
+  static final int MAX_INTERMEDIATE_SIZE = 1024; // 1 KiB

Review Comment:
   why is it 1KiB?



##########
extensions-contrib/druid-exact-cardinality/src/main/java/org/apache/druid/query/aggregation/exact/cardinality/bitmap64/RoaringBitmap64Counter.java:
##########
@@ -0,0 +1,100 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.query.aggregation.exact.cardinality.bitmap64;
+
+import org.apache.druid.java.util.common.logger.Logger;
+import org.roaringbitmap.longlong.Roaring64NavigableMap;
+
+import java.io.ByteArrayInputStream;
+import java.io.ByteArrayOutputStream;
+import java.io.DataInputStream;
+import java.io.DataOutputStream;
+import java.nio.ByteBuffer;
+
+public class RoaringBitmap64Counter implements Bitmap64Counter
+{
+  private static final Logger log = new Logger(RoaringBitmap64Counter.class);
+
+  private final Roaring64NavigableMap bitmap;
+
+  public RoaringBitmap64Counter()
+  {
+    this.bitmap = new Roaring64NavigableMap();
+  }
+
+  private RoaringBitmap64Counter(Roaring64NavigableMap bitmap)
+  {
+    this.bitmap = bitmap;
+  }
+
+  public static RoaringBitmap64Counter fromBytes(byte[] bytes)
+  {
+    try {
+      ByteArrayInputStream byteIn = new ByteArrayInputStream(bytes);
+      DataInputStream in = new DataInputStream(byteIn);
+      Roaring64NavigableMap bitmap = new Roaring64NavigableMap();
+      bitmap.deserialize(in);
+      return new RoaringBitmap64Counter(bitmap);
+    }
+    catch (Exception e) {
+      log.error(e, "Failed to deserialize RoaringBitmap64Counter from bytes");
+      throw new RuntimeException(
+          "Failed to deserialize RoaringBitmap64Counter from bytes. Error: " + 
e.getMessage(),
+          e
+      );
+    }
+  }
+
+  @Override
+  public void add(long value)
+  {
+    bitmap.addLong(value);
+  }
+
+  @Override
+  public long getCardinality()
+  {
+    return bitmap.getLongCardinality();
+  }
+
+  @Override
+  public Bitmap64Counter fold(Bitmap64Counter rhs)
+  {
+    if (rhs != null) {
+      bitmap.or(((RoaringBitmap64Counter) rhs).bitmap);
+    }
+
+    return this;
+  }
+
+  @Override
+  public ByteBuffer toByteBuffer()

Review Comment:
   this one is used in two places, one is in Bitmap64ExactCountObjectStrategy 
and the other is in RoaringBitmap64CounterJsonSerializer, for the later one, I 
think we can apply some optimization to eliminate byte array copy



##########
docs/development/extensions-contrib/druid-exact-cardinality.md:
##########
@@ -0,0 +1,443 @@
+---
+id: druid-exact-cardinality
+title: "Exact Cardinality"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+This extension provides exact cardinality counting functionality for LONG type 
columns using [Roaring Bitmaps](https://roaringbitmap.org/). Unlike approximate 
cardinality aggregators like HyperLogLog, this aggregator provides precise 
distinct counts.
+
+## Installation
+
+To use this Apache Druid extension, 
[include](../../configuration/extensions.md#loading-extensions) 
`druid-exact-cardinality` in the extensions load list.
+
+## How it Works
+
+The extension uses `Roaring64NavigableMap` as its underlying data structure to 
efficiently store and compute exact cardinality of 64-bit integers. It provides 
two types of aggregators that serve different purposes:
+
+### Build Aggregator (Bitmap64ExactCardinalityBuild)
+
+The BUILD aggregator is used when you want to compute cardinality directly 
from raw LONG values:
+
+- Used during ingestion or when querying raw data
+- Must be used on columns of type LONG, otherwise the output will be 1.
+
+Example:
+
+```json
+{
+  "type": "Bitmap64ExactCardinalityBuild",
+  "name": "unique_values",
+  "fieldName": "id"
+}
+```
+
+### Merge Aggregator (Bitmap64ExactCardinalityMerge)
+
+The MERGE aggregator is used when working with pre-computed bitmaps:
+
+- Used for querying pre-aggregated data (columns that were previously 
aggregated using BUILD)
+- Combines multiple bitmaps using bitwise operations
+- Must be used on columns that are aggregated using BUILD
+- `Bitmap64ExactCardinalityMerge` aggregator is recommended for use in 
`timeseries` type queries, though it also works for `topN` and `groupBy` 
queries.
+
+Example:
+
+```json
+{
+  "type": "Bitmap64ExactCardinalityMerge",
+  "name": "total_unique_values",
+  "fieldName": "unique_values" // Must be a pre-computed bitmap
+}
+```
+
+### Typical Workflow
+
+1. During ingestion, use BUILD to create the initial bitmap:
+    ```json
+    {
+      "type": "index",
+      "spec": {
+        "dataSchema": {
+          "metricsSpec": [
+            {
+              "type": "Bitmap64ExactCardinalityBuild",
+              "name": "unique_users",
+              "fieldName": "user_id"
+            }
+          ]
+        }
+      }
+    }
+    ```
+
+2. When querying the aggregated data, use MERGE to combine bitmaps:
+    ```json
+    {
+      "queryType": "timeseries",
+      "aggregations": [
+        {
+          "type": "Bitmap64ExactCardinalityMerge",
+          "name": "total_unique_users",
+          "fieldName": "unique_users"
+        }
+      ]
+    }
+    ```
+
+## Usage
+
+### SQL Query
+
+You can use the `BITMAP64_EXACT_CARDINALITY` function in SQL queries:

Review Comment:
   can this function be used over a raw column instead of pre-aggregated column 
built by `Bitmap64ExactCardinalityBuild`?



##########
extensions-contrib/druid-exact-cardinality/src/main/java/org/apache/druid/query/aggregation/exact/cardinality/bitmap64/sql/Bitmap64ExactCardinalitySqlAggregator.java:
##########
@@ -0,0 +1,163 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.query.aggregation.exact.cardinality.bitmap64.sql;
+
+import org.apache.calcite.rel.core.AggregateCall;
+import org.apache.calcite.rel.core.Project;
+import org.apache.calcite.rel.type.RelDataType;
+import org.apache.calcite.rex.RexBuilder;
+import org.apache.calcite.rex.RexNode;
+import org.apache.calcite.sql.SqlAggFunction;
+import org.apache.calcite.sql.SqlFunctionCategory;
+import org.apache.calcite.sql.type.InferTypes;
+import org.apache.calcite.sql.type.SqlTypeFamily;
+import org.apache.calcite.sql.type.SqlTypeName;
+import org.apache.druid.java.util.common.ISE;
+import org.apache.druid.query.aggregation.AggregatorFactory;
+import 
org.apache.druid.query.aggregation.exact.cardinality.bitmap64.Bitmap64ExactCardinalityBuildAggregatorFactory;
+import 
org.apache.druid.query.aggregation.exact.cardinality.bitmap64.Bitmap64ExactCardinalityMergeAggregatorFactory;
+import 
org.apache.druid.query.aggregation.post.FinalizingFieldAccessPostAggregator;
+import org.apache.druid.query.dimension.DefaultDimensionSpec;
+import org.apache.druid.query.dimension.DimensionSpec;
+import org.apache.druid.segment.column.ColumnType;
+import org.apache.druid.segment.column.RowSignature;
+import org.apache.druid.segment.column.ValueType;
+import org.apache.druid.sql.calcite.aggregation.Aggregation;
+import org.apache.druid.sql.calcite.aggregation.SqlAggregator;
+import org.apache.druid.sql.calcite.expression.DruidExpression;
+import org.apache.druid.sql.calcite.expression.Expressions;
+import org.apache.druid.sql.calcite.expression.OperatorConversions;
+import org.apache.druid.sql.calcite.planner.Calcites;
+import org.apache.druid.sql.calcite.planner.PlannerContext;
+import org.apache.druid.sql.calcite.rel.VirtualColumnRegistry;
+
+import javax.annotation.Nullable;
+import java.util.Collections;
+import java.util.List;
+
+public class Bitmap64ExactCardinalitySqlAggregator implements SqlAggregator
+{
+
+  private static final String NAME = "BITMAP64_EXACT_CARDINALITY";
+  private static final SqlAggFunction FUNCTION_INSTANCE = 
OperatorConversions.aggregatorBuilder(NAME)
+                                                                             
.operandNames("column")
+                                                                             
.operandTypes(SqlTypeFamily.ANY)

Review Comment:
   the operandType is declared as ANY which means it can be used with ANY data 
type. however, since only LONG is supported, it's better to limit the input to 
type of LONG



##########
extensions-contrib/druid-exact-cardinality/src/main/java/org/apache/druid/query/aggregation/exact/cardinality/bitmap64/Bitmap64ExactCardinalityObjectStrategy.java:
##########
@@ -0,0 +1,69 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.query.aggregation.exact.cardinality.bitmap64;
+
+import org.apache.druid.segment.data.ObjectStrategy;
+
+import javax.annotation.Nullable;
+import java.nio.BufferUnderflowException;
+import java.nio.ByteBuffer;
+
+public class Bitmap64ExactCardinalityObjectStrategy implements 
ObjectStrategy<Bitmap64Counter>
+{
+
+  static final Bitmap64ExactCardinalityObjectStrategy STRATEGY = new 
Bitmap64ExactCardinalityObjectStrategy();
+
+  @Override
+  public Class<? extends Bitmap64Counter> getClazz()
+  {
+    return RoaringBitmap64Counter.class;
+  }
+
+  @Nullable
+  @Override
+  public Bitmap64Counter fromByteBuffer(ByteBuffer buffer, int numBytes)
+  {
+    final ByteBuffer readOnlyBuffer = buffer.asReadOnlyBuffer();
+
+    if (readOnlyBuffer.remaining() < numBytes) {
+      throw new BufferUnderflowException();
+    }
+
+    byte[] bytes = new byte[numBytes];
+    readOnlyBuffer.get(bytes, 0, numBytes);
+    return RoaringBitmap64Counter.fromBytes(bytes);

Review Comment:
   can we eliminate the byte array copy by passing a customized InputStream to 
RoaringBitmap64Counter?



##########
extensions-contrib/druid-exact-cardinality/pom.xml:
##########
@@ -0,0 +1,164 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+<project xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"; 
xmlns="http://maven.apache.org/POM/4.0.0";
+         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
http://maven.apache.org/maven-v4_0_0.xsd";>
+  <modelVersion>4.0.0</modelVersion>
+
+  <groupId>org.apache.druid.extensions.contrib</groupId>
+  <artifactId>druid-exact-cardinality</artifactId>
+  <name>druid-exact-cardinality</name>
+  <description>Druid Aggregators for exact cardinality</description>
+
+  <parent>
+    <groupId>org.apache.druid</groupId>
+    <artifactId>druid</artifactId>
+    <version>34.0.0-SNAPSHOT</version>
+    <relativePath>../../pom.xml</relativePath>
+  </parent>
+
+  <dependencies>
+    <dependency>
+      <groupId>org.apache.calcite</groupId>
+      <artifactId>calcite-core</artifactId>
+      <scope>provided</scope>
+    </dependency>
+    <dependency>
+      <groupId>org.apache.druid</groupId>
+      <artifactId>druid-processing</artifactId>
+      <version>${project.parent.version}</version>
+      <scope>provided</scope>
+    </dependency>
+    <dependency>
+      <groupId>org.apache.druid</groupId>
+      <artifactId>druid-sql</artifactId>
+      <version>${project.parent.version}</version>
+      <scope>provided</scope>
+    </dependency>
+    <dependency>
+      <groupId>com.google.code.findbugs</groupId>
+      <artifactId>jsr305</artifactId>
+      <scope>provided</scope>
+    </dependency>
+    <dependency>
+      <groupId>com.google.guava</groupId>
+      <artifactId>guava</artifactId>
+      <scope>provided</scope>
+    </dependency>
+    <dependency>
+      <groupId>com.google.inject</groupId>
+      <artifactId>guice</artifactId>
+      <scope>provided</scope>
+    </dependency>
+    <dependency>
+      <groupId>org.roaringbitmap</groupId>
+      <artifactId>RoaringBitmap</artifactId>
+      <version>0.9.49</version>
+      <scope>provided</scope>
+    </dependency>
+    <dependency>
+      <groupId>it.unimi.dsi</groupId>
+      <artifactId>fastutil-core</artifactId>
+      <version>8.5.4</version>
+      <scope>provided</scope>
+    </dependency>
+
+    <dependency>
+      <groupId>com.fasterxml.jackson.core</groupId>
+      <artifactId>jackson-annotations</artifactId>
+      <scope>provided</scope>
+    </dependency>
+    <dependency>
+      <groupId>com.fasterxml.jackson.core</groupId>
+      <artifactId>jackson-core</artifactId>
+      <scope>provided</scope>
+    </dependency>
+    <dependency>
+      <groupId>com.fasterxml.jackson.core</groupId>
+      <artifactId>jackson-databind</artifactId>
+      <scope>provided</scope>
+    </dependency>
+
+    <!-- Test Dependencies -->
+    <dependency>
+      <groupId>junit</groupId>
+      <artifactId>junit</artifactId>
+      <scope>test</scope>
+    </dependency>
+    <dependency>
+      <groupId>org.junit.jupiter</groupId>
+      <artifactId>junit-jupiter-api</artifactId>
+      <scope>test</scope>
+    </dependency>
+    <dependency>
+      <groupId>org.easymock</groupId>
+      <artifactId>easymock</artifactId>
+      <scope>test</scope>
+    </dependency>
+    <dependency>
+      <groupId>org.apache.druid</groupId>
+      <artifactId>druid-processing</artifactId>
+      <version>${project.parent.version}</version>
+      <type>test-jar</type>
+      <scope>test</scope>
+    </dependency>
+    <dependency>
+      <groupId>org.apache.druid</groupId>
+      <artifactId>druid-server</artifactId>
+      <version>${project.parent.version}</version>
+      <type>test-jar</type>
+      <scope>test</scope>
+    </dependency>
+    <dependency>
+      <groupId>org.apache.druid</groupId>
+      <artifactId>druid-sql</artifactId>
+      <version>${project.parent.version}</version>
+      <type>test-jar</type>
+      <scope>test</scope>
+    </dependency>
+    <dependency>
+      <groupId>org.hamcrest</groupId>
+      <artifactId>hamcrest-core</artifactId>
+      <scope>test</scope>
+    </dependency>
+    <dependency>
+      <groupId>org.reflections</groupId>
+      <artifactId>reflections</artifactId>
+      <scope>test</scope>
+    </dependency>
+  </dependencies>
+
+  <build>
+    <plugins>
+      <plugin>
+        <groupId>org.apache.maven.plugins</groupId>
+        <artifactId>maven-jar-plugin</artifactId>

Review Comment:
   what is this used for



##########
docs/configuration/extensions.md:
##########
@@ -86,6 +86,7 @@ All of these community extensions can be downloaded using 
[pull-deps](../operati
 |druid-ddsketch|Support for DDSketch approximate quantiles based on 
[DDSketch](https://github.com/datadog/sketches-java) | 
[link](../development/extensions-contrib/ddsketch-quantiles.md)|
 |druid-deltalake-extensions|Support for ingesting Delta Lake 
tables.|[link](../development/extensions-contrib/delta-lake.md)|
 |druid-distinctcount|DistinctCount 
aggregator|[link](../development/extensions-contrib/distinctcount.md)|
+|druid-exact-cardinality|Support for exact cardinality counting using Roaring 
Bitmaps.|[link](../development/extensions-contrib/druid-exact-cardinality.md)|

Review Comment:
   ```suggestion
   |druid-exact-cardinality|Support for exact cardinality counting using 
Roaring Bitmap over a Long 
column.|[link](../development/extensions-contrib/druid-exact-cardinality.md)|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to