cboumalh commented on code in PR #51298: URL: https://github.com/apache/spark/pull/51298#discussion_r2220589709
########## sql/api/src/main/scala/org/apache/spark/sql/functions.scala: ########## @@ -1149,6 +1149,169 @@ object functions { */ def sum_distinct(e: Column): Column = Column.fn("sum", isDistinct = true, e) + /** + * Aggregate function: returns the compact binary representation of the Datasketches + * ThetaSketch, generated by intersecting previously created Datasketches ThetaSketch instances + * via a Datasketches Intersection instance. Allows setting of log nominal entries for the + * intersection buffer. + * + * @group agg_funcs + * @since 4.0.0 + */ + def theta_intersection_agg(e: Column, lgNomEntries: Column): Column = Review Comment: Q1: https://github.com/apache/spark/blob/b3cb40a3524023618b932c7b744bb412e7d62bc9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/thetasketchesExpressions.scala#L90 The line above is in ThetaUnion, same is applied in ThetaIntersection and ThetaDifference https://github.com/apache/spark/blob/b3cb40a3524023618b932c7b744bb412e7d62bc9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/thetasketchesAggregates.scala#L88 The line above is in ThetaSketchAgg, same is applied in ThetaUnionAgg and ThetaIntersectionAgg Q2: Yes, unfortunately the way it is currently designed the value would have to be specified in each call. Maybe it can be set when the spark session is defined (not sure how feasible), or users can define a wrapper that sets the nominal entries variable. The lgNominalEntries variable is only accessible in the UpdateSketch class which is why I had to go with this approach. In other words the way theta sketches are designed makes it impossible to keep track of this variable while not overcomplicating the codebase or creating significant inefficiencies. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org