Re: [PR] [SPARK-52588][SQL] Approx_top_k: accumulate and estimate [spark]

via GitHub Fri, 11 Jul 2025 14:12:42 -0700


yhuang-db commented on code in PR #51308:
URL: https://github.com/apache/spark/pull/51308#discussion_r2201876523



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxTopKAggregates.scala:
##########
@@ -305,4 +313,123 @@ object ApproxTopK {
         new ArrayOfDecimalsSerDe(dt).asInstanceOf[ArrayOfItemsSerDe[Any]]
     }
   }
+
+  def getSketchStateDataType(itemDataType: DataType): StructType =
+    StructType(
+      StructField("Sketch", BinaryType, nullable = false) ::
+        StructField("ItemTypeNull", itemDataType) ::
+        StructField("MaxItemsTracked", IntegerType, nullable = false) :: Nil)
+}
+
+/**
+ * An aggregate function that accumulates items into a sketch, which can then 
be used
+ * to combine with other sketches, via ApproxTopKCombine,
+ * or to estimate the top K items, via ApproxTopKEstimate.
+ *
+ * The output of this function is a struct containing the sketch in binary 
format,
+ * a null object indicating the type of items in the sketch,
+ * and the maximum number of items tracked by the sketch.
+ *
+ * @param expr            the child expression to accumulate items from
+ * @param maxItemsTracked the maximum number of items to track in the sketch

Review Comment:
   Added 
https://github.com/apache/spark/pull/51308/commits/801929cec0dd0b15b3be44506a1f3237a4f1364d



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-52588][SQL] Approx_top_k: accumulate and estimate [spark]

Reply via email to