[GitHub] [spark] mkaravel commented on a diff in pull request #40615: [SPARK-16484][SQL] Add support for Datasketches HllSketch

via GitHub Fri, 21 Apr 2023 00:33:28 -0700


mkaravel commented on code in PR #40615:
URL: https://github.com/apache/spark/pull/40615#discussion_r1173416750



##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/aggregate/DatasketchesHllSketchSuite.scala:
##########
@@ -0,0 +1,111 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import scala.collection.immutable.NumericRange
+import scala.util.Random
+
+import org.apache.datasketches.hll.HllSketch
+import org.apache.datasketches.memory.Memory
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{BoundReference, 
HllSketchEstimate}
+import org.apache.spark.sql.types.{BinaryType, DataType, IntegerType, 
LongType, StringType}
+import org.apache.spark.unsafe.types.UTF8String
+
+
+class DatasketchesHllSketchSuite extends SparkFunSuite {

Review Comment:
   I think it adds significant cognitive overhead and does not offer much (in 
my opinion): this is about the internal representation of the sketch and not 
does not affect precision. The eventual goal when computing sketches is 
computing NDV estimates. How I get to these estimates matters less I believe, 
as long as I maintain the precision that I want.
   
   Also if we are to add it to the `hll_sketch_agg` why not also add it to 
`hll_union_agg` and `hll_union` for symmetry? And this is where I feel the 
entire API becomes too complicated (see also the discussion in another comment 
about mixing `lgk` values).
   
   In general my approach on sketches as an imaginary user is the following:
   
   Context:
   * _I have data that I want to compute the NDVs for. Exact NDV computation is 
costly and not necessary for my application, so I am okay if I have a way to 
compute approximate NDVs if they are faster._
   * _This is not an one-off operation for me: I want to be able to somehow 
update my approximate NDVs as more data gets it. I also want to combine NDVs 
from different sources, or across sources and group them together according to 
my application needs._
   
   _I bump onto the Spark documentation and see that Spark offers this amazing 
tool called HLL sketches. What is really a sketch? It's a binary value that 
internally encodes the number of NDVs in my data with some precision. I also 
see that Spark allows me to combine two or more sketches into one, and that 
from a sketch I can get an NDV estimate by calling a function. Can I control 
the precision? Yes, but for more precision I need more space, that's the 
trade-off I have to live with. Do I need to know the internal encoding to use 
this in my application? Not really. As long as Spark gives me a good API to 
manage my sketches (which it does: I can create sketches, I can merge them, I 
can get the NDV estimate any time I want) I do not really care how the sketch 
is encoded. I do not want it to be slow for sure, because I may need to do this 
for massive amounts of data._
   
   The only users that would want to know the internal encoding and the number 
of bits per register are those that might want to balance between the size of 
the sketch and it's runtime efficiency. We are talking about really fine tuning 
here, where one tries to balance between space requirements for the sketches in 
terms of storage, precision, runtime performance, and stress on the system 
while executing queries. I am not expecting the power users that I am 
describing here to be many. And if they exist, we can always add overloads that 
would allow the manipulation of the register size.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] mkaravel commented on a diff in pull request #40615: [SPARK-16484][SQL] Add support for Datasketches HllSketch

Reply via email to