mkaravel commented on PR #40615:
URL: https://github.com/apache/spark/pull/40615#issuecomment-1505627742
> Hi @mkaravel thank you for the review! I'll respond to your comments
in-line.
>
> > more in favor of an aggregate function that merges sketches and returns
the merged sketch
>
> I'm not opposed to building out an agg function that merges sketches
without estimating the cardinality; I think this would be beneficial for
multi-stage re-aggregations. I think this topic (unfortunately) begs the
question of function naming, and here's the resultant naming scheme I'd propose:
>
> Aggregate functions:
>
> * HllSketch (IntegerType|LongType|StringType|...) -> BinaryType
> * HllUnion (BinaryType) -> BinaryType
>
> Normal functions
>
> * HllSketchEstimate (BinaryType) -> LongType
>
> I think this API is simple and the lack of an aggregate function that
returns the estimated cardinality is fine. What do you think?
Semantically I like this API. There is one more scalar expression that I
would add, it's purpose being to merge two sketches. Let me give some more
details below.
In terms of naming I am more in favor of dropping the "Sketch" part from the
names and using an underscore to separate the "Hll" part from the rest. More
specifically this is what I would choose (this is a personal opinion, and I am
not reflecting opinions of anybody else other than myself):
Aggregate functions:
* hll_collect (IntegerType|LongType|StringType|BinaryType|...) -> BinaryType
* {hll_merge_agg, hll_union_agg}(BinaryType) -> BinaryType
Scalar functions:
* {hll_merge, hll_union}(BinaryType, BinaryType) -> BinaryType
* hll_estimate(BinaryType) -> LongType
Above I propose hll_merge* or hll_union* for the functions that perform the
union (or merge operation) on sketches. I have a slight preference for the
"merge" versions.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]