dtenedor commented on PR #39678: URL: https://github.com/apache/spark/pull/39678#issuecomment-1399101487
Hi @RyanBerti, after a few initial conversations about this proposal, we wanted to express some questions and opinions here for your consideration. In general, we wholeheartedly appreciate your time, energy, and attention in working on this collection of features in the interest of expanding the functionality and usefulness of Apache Spark (as well as increasing consistency with other engines). * We feel strongly that the new function names should not begin with "approx_count_distinct" in order to differentiate the new sketch-based approach against the existing approximate distinct count implementation. In particular, the sketch-based approach exposes intermediate buffers suitable for saving into storage and then retrieving and merging later, whereas the existing "approx_count_distinct" aggregate function consumes input values and then returns the result immediately (without exposing any intermediate sketch to the user). This is essentially a function naming detail, and we can consider a prefix like "HLL_" for the new functions in order to accomplish this objective. This change should not consume too much time to bring into effect. * After examining the existing helper library in the existing [HyperLogLogPlusPLusHelper utility](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/HyperLogLogPlusPlusHelper.scala), and comparing to other formats for generating and merging sketches for similar purposes, we would like to express a strong preference for using the Apache DataSketches library instead (https://datasketches.apache.org/). This is a dedicated Apache project with more than a decade of investment behind it, supporting very high performance for multiple types of sketches (such as KLL sketches for computing approximate quantiles, for example) and with client library support in C++, Java, and Python. We can use the Java implementation to integrate with Apache Spark using Scala/Java integration. This also aligns with the principles of Apache Spark as an open source project Best, Daniel @czarifis @mkaravel @gatorsmile @entong -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org