Re: [PR] [SPARK-54785][SQL] Add support for binary sketch aggregations in KLL [spark]

via GitHub Thu, 08 Jan 2026 10:39:46 -0800


cboumalh commented on code in PR #53548:
URL: https://github.com/apache/spark/pull/53548#discussion_r2673115259



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/kllAggregates.scala:
##########
@@ -449,6 +449,414 @@ case class KllSketchAggDouble(
   }
 }
 
+/**
+ * The KllMergeAggBigint function merges multiple Apache DataSketches 
KllLongsSketch instances
+ * that have been serialized to binary format. This is useful for combining 
sketches created
+ * in separate aggregations (e.g., from different partitions or time windows).
+ * It outputs the merged binary representation of the KllLongsSketch.
+ *
+ * See [[https://datasketches.apache.org/docs/KLL/KLLSketch.html]] for more 
information.
+ *
+ * @param child
+ *   child expression containing binary KllLongsSketch representations to merge
+ * @param kExpr
+ *   optional expression for the k parameter from the Apache DataSketches 
library that controls

Review Comment:
   Good point, so I looked into the KLL library a bit further and from what I 
understood, in the aggregate version, the value of K creates two scenarios: 1) 
it's smaller than the sketches' K values that are about to be aggregated and we 
lose accuracy 2) bigger than the sketches' K values that are about to be 
aggregated which creates more space per level in the KLL sketch (and the sketch 
as a whole) which improves accuracy (improves here means less compaction 
happens). So, using this responsibly provides value to the user. **Note**: 
Since we can access the K value when we heapify or wrap a KLL sketch we can do 
something similar to HLL where we defer the creation of the aggregation buffer 
until we encounter the first sketch in aggregation and use its K value (see 
[here](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/datasketchesAggregates.scala#L306-L404)).
 HLL has the boolean as a second argument which allows
  users to raise an error if the sketches being aggregated have different 
sizes, but I'm not sure this is necessary if we can assume users know what they 
are doing? I think the best of both worlds would be to use the K value if it is 
passed in, else use the K from the first sketch for the aggregation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-54785][SQL] Add support for binary sketch aggregations in KLL [spark]

Reply via email to