cboumalh commented on code in PR #51298:
URL: https://github.com/apache/spark/pull/51298#discussion_r2220589709


##########
sql/api/src/main/scala/org/apache/spark/sql/functions.scala:
##########
@@ -1149,6 +1149,169 @@ object functions {
    */
   def sum_distinct(e: Column): Column = Column.fn("sum", isDistinct = true, e)
 
+  /**
+   * Aggregate function: returns the compact binary representation of the 
Datasketches
+   * ThetaSketch, generated by intersecting previously created Datasketches 
ThetaSketch instances
+   * via a Datasketches Intersection instance. Allows setting of log nominal 
entries for the
+   * intersection buffer.
+   *
+   * @group agg_funcs
+   * @since 4.0.0
+   */
+  def theta_intersection_agg(e: Column, lgNomEntries: Column): Column =

Review Comment:
   Q1:
   
https://github.com/apache/spark/blob/b3cb40a3524023618b932c7b744bb412e7d62bc9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/thetasketchesExpressions.scala#L90
   
   The line above is in ThetaUnion, same is applied in ThetaIntersection and 
ThetaDifference
   
   
https://github.com/apache/spark/blob/b3cb40a3524023618b932c7b744bb412e7d62bc9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/thetasketchesAggregates.scala#L88
   
   The line above is in ThetaSketchAgg, same is applied in ThetaUnionAgg and 
ThetaIntersectionAgg
   
   Q2:
   Yes, unfortunately the way it is currently designed the value would have to 
be specified in each call. Maybe it can be set when the spark session is 
defined (not sure how feasible), or users can define a wrapper that sets the 
nominal entries variable. The lgNominalEntries variable is only accessible in 
the UpdateSketch class which is why I had to go with this approach. In other 
words the way theta sketches are designed makes it impossible to keep track of 
this variable while not overcomplicating the codebase or creating significant 
inefficiencies.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to