cboumalh commented on code in PR #53548:
URL: https://github.com/apache/spark/pull/53548#discussion_r2673115259


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/kllAggregates.scala:
##########
@@ -449,6 +449,414 @@ case class KllSketchAggDouble(
   }
 }
 
+/**
+ * The KllMergeAggBigint function merges multiple Apache DataSketches 
KllLongsSketch instances
+ * that have been serialized to binary format. This is useful for combining 
sketches created
+ * in separate aggregations (e.g., from different partitions or time windows).
+ * It outputs the merged binary representation of the KllLongsSketch.
+ *
+ * See [[https://datasketches.apache.org/docs/KLL/KLLSketch.html]] for more 
information.
+ *
+ * @param child
+ *   child expression containing binary KllLongsSketch representations to merge
+ * @param kExpr
+ *   optional expression for the k parameter from the Apache DataSketches 
library that controls

Review Comment:
   Good point, so I looked into the KLL library a bit further and from what I 
understood, in the aggregate version, the value of K creates two scenarios: 1) 
it's smaller than the sketches' K values that are about to be aggregated and we 
lose accuracy 2) bigger than the sketches' K values that are about to be 
aggregated which creates more space per level in the KLL sketch (and the sketch 
as a whole) which improves accuracy. So, using this responsibly provides value 
to the user. **Note**: Since we can access the K value when we heapify or wrap 
a KLL sketch we can do something similar to HLL where we defer the creation of 
the aggregation buffer until we encounter the first sketch in aggregation and 
use its K value (see 
[here](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/datasketchesAggregates.scala#L306-L404)).
 HLL has the boolean as a second argument which allows users to raise an error 
if the sketches being
  aggregated have different sizes, but I'm not sure this is necessary if we can 
assume users know what they are doing? I think the best of both worlds would be 
to use the K value if it is passed in, else use the K from the first sketch for 
the aggregation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to