Rebootor opened a new issue, #9202:
URL: https://github.com/apache/incubator-gluten/issues/9202

   ### Description
   
   Gluten currently implements the `approx_count_distinct` function, leveraging 
HyperLogLog (HLL) or similar approximate cardinality estimation algorithms. 
However, the underlying binary sketch representation generated by these 
algorithms is not exposed to the user.
   
   **Problem:**
   This limitation prevents users from:
   
   1. Persistently storing the sketches: The inability to serialize and store 
the binary sketch hinders offline analysis and long-term data aggregation.
   2. Merging sketches: Merging sketches from different datasets or partitions 
is essential for accurate cardinality estimation across larger datasets. 
Without access to the binary representation, this operation is not feasible.
   3. Performing custom analysis: Users requiring advanced cardinality analysis 
or integration with external systems are restricted by the lack of direct 
access to the sketch.
   
   **Proposed Solution:**
   Expose the binary sketch representation as a `BINARY` or `BYTE_ARRAY` type. 
This would allow users to:
   1. Retrieve the binary sketch
   2. Store the binary sketch
   3. Merge binary sketches
   4. Estimate cardinality from the merged sketch


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to