cboumalh commented on code in PR #53548: URL: https://github.com/apache/spark/pull/53548#discussion_r2673115259
########## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/kllAggregates.scala: ########## @@ -449,6 +449,414 @@ case class KllSketchAggDouble( } } +/** + * The KllMergeAggBigint function merges multiple Apache DataSketches KllLongsSketch instances + * that have been serialized to binary format. This is useful for combining sketches created + * in separate aggregations (e.g., from different partitions or time windows). + * It outputs the merged binary representation of the KllLongsSketch. + * + * See [[https://datasketches.apache.org/docs/KLL/KLLSketch.html]] for more information. + * + * @param child + * child expression containing binary KllLongsSketch representations to merge + * @param kExpr + * optional expression for the k parameter from the Apache DataSketches library that controls Review Comment: Good point, so I looked into the KLL library a bit further and from what I understood, in the aggregate version, the value of K creates two scenarios: 1) it's smaller than the sketches' K values that are about to be aggregated and we lose accuracy 2) bigger than the sketches' K values that are about to be aggregated which creates more space per level in the KLL sketch (and the sketch as a whole) which improves accuracy (improves here means less compaction happens). So, using this responsibly provides value to the user. **Note**: Since we can access the K value when we heapify or wrap a KLL sketch we can do something similar to HLL where we defer the creation of the aggregation buffer until we encounter the first sketch in aggregation and use its K value (see [here](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/datasketchesAggregates.scala#L306-L404)). HLL has the boolean as a second argument which allows users to raise an error if the sketches being aggregated have different sizes, but I'm not sure this is necessary if we can assume users know what they are doing? I think the best of both worlds would be to use the K value if it is passed in, else use the K from the first sketch for the aggregation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
