mkaravel commented on code in PR #40615: URL: https://github.com/apache/spark/pull/40615#discussion_r1173416750
########## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/aggregate/DatasketchesHllSketchSuite.scala: ########## @@ -0,0 +1,111 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions.aggregate + +import scala.collection.immutable.NumericRange +import scala.util.Random + +import org.apache.datasketches.hll.HllSketch +import org.apache.datasketches.memory.Memory + +import org.apache.spark.SparkFunSuite +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.{BoundReference, HllSketchEstimate} +import org.apache.spark.sql.types.{BinaryType, DataType, IntegerType, LongType, StringType} +import org.apache.spark.unsafe.types.UTF8String + + +class DatasketchesHllSketchSuite extends SparkFunSuite { Review Comment: I think it adds significant cognitive overhead and does not offer much (in my opinion): this is about the internal representation of the sketch and not does not affect precision. The eventual goal when computing sketches is computing NDV estimates. How I get to these estimates matters less I believe, as long as I maintain the precision that I want. Also if we are to add it to the `hll_sketch_agg` why not also add it to `hll_union_agg` and `hll_union` for symmetry? And this is where I feel the entire API becomes too complicated (see also the discussion in another comment about mixing `lgk` values). In general my approach on sketches as an imaginary user is the following: Context: * _I have data that I want to compute the NDVs for. Exact NDV computation is costly and not necessary for my application, so I am okay if I have a way to compute approximate NDVs if they are faster._ * _This is not an one-off operation for me: I want to be able to somehow update my approximate NDVs as more data gets it. I also want to combine NDVs from different sources, or across sources and group them together according to my application needs._ _I bump onto the Spark documentation and see that Spark offers this amazing tool called HLL sketches. What is really a sketch? It's a binary value that internally encodes the number of NDVs in my data with some precision. I also see that Spark allows me to combine two or more sketches into one, and that from a sketch I can get an NDV estimate by calling a function. Can I control the precision? Yes, but for more precision I need more space, that's the trade-off I have to live with. Do I need to know the internal encoding to use this in my application? Not really. As long as Spark gives me a good API to manage my sketches (which it does: I can create sketches, I can merge them, I can get the NDV estimate any time I want) I do not really care how the sketch is encoded. I do not want it to be slow for sure, because I may need to do this for massive amounts of data._ The only users that would want to know the internal encoding and the number of bits per register are those that might want to balance between the size of the sketch and it's runtime efficiency. We are talking about really fine tuning here, where one tries to balance between space requirements for the sketches in terms of storage, precision, runtime performance, and stress on the system while executing queries. I am not expecting the power users that I am describing here to be many. And if they exist, we can always add overloads that would allow the manipulation of the register size. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
