Rebootor opened a new issue, #9202: URL: https://github.com/apache/incubator-gluten/issues/9202
### Description Gluten currently implements the `approx_count_distinct` function, leveraging HyperLogLog (HLL) or similar approximate cardinality estimation algorithms. However, the underlying binary sketch representation generated by these algorithms is not exposed to the user. **Problem:** This limitation prevents users from: 1. Persistently storing the sketches: The inability to serialize and store the binary sketch hinders offline analysis and long-term data aggregation. 2. Merging sketches: Merging sketches from different datasets or partitions is essential for accurate cardinality estimation across larger datasets. Without access to the binary representation, this operation is not feasible. 3. Performing custom analysis: Users requiring advanced cardinality analysis or integration with external systems are restricted by the lack of direct access to the sketch. **Proposed Solution:** Expose the binary sketch representation as a `BINARY` or `BYTE_ARRAY` type. This would allow users to: 1. Retrieve the binary sketch 2. Store the binary sketch 3. Merge binary sketches 4. Estimate cardinality from the merged sketch -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
