zmin1217 opened a new issue, #64:
URL: https://github.com/apache/datasketches-hive/issues/64
On the TPC-H dataset,i use theta sketch to get intersect,Error of some
results reaches 41%, but the doc say the default size(4096) about 3% error.
spark.sql("create temporary function data2sketch as
'org.apache.datasketches.hive.theta.DataToSketchUDAF'")
spark.sql("create temporary function intersect as
'org.apache.datasketches.hive.theta.IntersectSketchUDF'")
spark.sql("create temporary function estimate as
'org.apache.datasketches.hive.theta.EstimateSketchUDF'")
scala>
lineitem.select("l_suppkey").intersect(order.select("o_orderkey")).count
res17: Long = 250000
but theta sketch result is 145593, the error is 0.41
scala>
customer.select("c_custkey").intersect(lineitem.select("l_orderkey")).count
res18: Long = 3750000
but theta sketch result is 4404198, the error is 0.14
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]