aggregator

via GitHub Fri, 20 Jan 2023 16:34:57 -0800


dtenedor commented on PR #39678:
URL: https://github.com/apache/spark/pull/39678#issuecomment-1399101487


   Hi @RyanBerti, after a few initial conversations about this proposal, we 
wanted to express some questions and opinions here for your consideration. In 
general, we wholeheartedly appreciate your time, energy, and attention in 
working on this collection of features in the interest of expanding the 
functionality and usefulness of Apache Spark (as well as increasing consistency 
with other engines).
   
   * We feel strongly that the new function names should not begin with 
"approx_count_distinct" in order to differentiate the new sketch-based approach 
against the existing approximate distinct count implementation. In particular, 
the sketch-based approach exposes intermediate buffers suitable for saving into 
storage and then retrieving and merging later, whereas the existing 
"approx_count_distinct" aggregate function consumes input values and then 
returns the result immediately (without exposing any intermediate sketch to the 
user). This is essentially a function naming detail, and we can consider a 
prefix like "HLL_" for the new functions in order to accomplish this objective. 
This change should not consume too much time to bring into effect.
   
   * After examining the existing helper library in the existing 
[HyperLogLogPlusPLusHelper 
utility](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/HyperLogLogPlusPlusHelper.scala),
 and comparing to other formats for generating and merging sketches for similar 
purposes, we would like to express a strong preference for using the Apache 
DataSketches library instead (https://datasketches.apache.org/). This is a 
dedicated Apache project with more than a decade of investment behind it, 
supporting very high performance for multiple types of sketches (such as KLL 
sketches for computing approximate quantiles, for example) and with client 
library support in C++, Java, and Python. We can use the Java implementation to 
integrate with Apache Spark using Scala/Java integration. This also aligns with 
the principles of Apache Spark as an open source project 
   
   Best,
   Daniel
   
   @czarifis
   @mkaravel
   @gatorsmile 
   @entong


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dtenedor commented on pull request #39678: [SPARK-16484][SQL] Add HyperLogLogPlusPlus sketch generator/evaluator/aggregator

Reply via email to