mkaravel commented on PR #40615:
URL: https://github.com/apache/spark/pull/40615#issuecomment-1505627742

   > Hi @mkaravel thank you for the review! I'll respond to your comments 
in-line.
   > 
   > > more in favor of an aggregate function that merges sketches and returns 
the merged sketch
   > 
   > I'm not opposed to building out an agg function that merges sketches 
without estimating the cardinality; I think this would be beneficial for 
multi-stage re-aggregations. I think this topic (unfortunately) begs the 
question of function naming, and here's the resultant naming scheme I'd propose:
   > 
   > Aggregate functions:
   > 
   > * HllSketch (IntegerType|LongType|StringType|...) -> BinaryType
   > * HllUnion (BinaryType) -> BinaryType
   > 
   > Normal functions
   > 
   > * HllSketchEstimate (BinaryType) -> LongType
   > 
   > I think this API is simple and the lack of an aggregate function that 
returns the estimated cardinality is fine. What do you think?
   
   Semantically I like this API. There is one more scalar expression that I 
would add, it's purpose being to merge two sketches. Let me give some more 
details below.
   
   In terms of naming I am more in favor of dropping the "Sketch" part from the 
names and using an underscore to separate the "Hll" part from the rest. More 
specifically this is what I would choose (this is a personal opinion, and I am 
not reflecting opinions of anybody else other than myself):
   
   Aggregate functions:
   
   * hll_collect (IntegerType|LongType|StringType|BinaryType|...) -> BinaryType
   * {hll_merge_agg, hll_union_agg}(BinaryType) -> BinaryType
   
   Scalar functions:
   * {hll_merge, hll_union}(BinaryType, BinaryType) -> BinaryType
   * hll_estimate(BinaryType) -> LongType
   
   Above I propose hll_merge* or hll_union* for the functions that perform the 
union (or merge operation) on sketches. I have a slight preference for the 
"merge" versions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to