[GitHub] [spark] mkaravel commented on pull request #40615: [SPARK-16484][SQL] Add support for Datasketches HllSketch

via GitHub Fri, 21 Apr 2023 00:00:57 -0700


mkaravel commented on PR #40615:
URL: https://github.com/apache/spark/pull/40615#issuecomment-1517360699


   > @mkaravel regarding your comment about 'mixing sketches with different lgk 
values',[ this is the Union implementation which handles merging sketches with 
different 
configs](https://github.com/apache/datasketches-java/blob/master/src/main/java/org/apache/datasketches/hll/Union.java#L317-L340);
 my assumption is that the functionality is tested and stable, but let me know 
if you think we should try to limit the union operation to only support 
sketches with the same config?
   
   I am not questioning the correctness of the DataSketches implementation.
   
   My concern is accidental mistakes that can happen if the user does not pay 
attention to the `lgk` values of the input sketches. I would argue that merging 
two sketches with different `lgk` values is "advanced" usage of sketches, and 
the user should be aware that mixing such sketches comes with caveats (loosing 
precision with respect to the sketch with higher `lgk`). The current API hides 
this complexity and caveats.
   
   Let's consider another alternative (I want your opinion on this): Let's say 
we have two overloads (we can extend this to the aggregate version)
   * hll_merge(sketch1: BinaryType, sketch2: BinaryType)
   * hll_merge(sketch1: BinaryType, sketch2: BinaryType, allowDifferentLgKs: 
BooleanType)
   
   The first errors out if the `lgk` values are different.
   The second errors out if the `lgk` values are different and 
`allowDifferentLgKs` is `false`. However, if `allowDifferentLgk` is `true` then 
the second overload behaves as your current implementation.
   
   Clearly, I am talking about adding a third boolean argument, with the 
default value being `false`. With these two overloads, if the user tries to 
merge two sketches with different precision, the query will fail. If they 
really need to merge them, they have the opportunity to do that by means of  
setting the third argument to `true`, and it will not happen accidentally 
without them noticing it and being proactive about the loss of precision. A 
nice error message for the "first overload" and good documentation will make it 
very clear what is going on.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] mkaravel commented on pull request #40615: [SPARK-16484][SQL] Add support for Datasketches HllSketch

Reply via email to