thomasrebele commented on issue #700: URL: https://github.com/apache/datasketches-java/issues/700#issuecomment-3670891384
Thanks to everyone for all the background information. I'm not yet sure whether I'll just limit the resolution of the selectivity estimate of the predicate to the resolution of the KLL sketch (e.g., the 1.65% for k=200), or whether I use interpolation. The results don't need to be perfect (those of the KLL sketch aren't either), just good enough for the purpose of selectivity estimation. Thank you for the suggestion, @leerho, I have already used that to implement my own interpolation. I did some experiments with cubic splines interpolation. At first I used some artificial histograms and compared the interpolated vs the non-interpolated average error on a Gaussian distribution. Interpolation seemed to reduce the error noticeably. However, when trying to do the same with the KLL sketch, the two errors were almost the same. The cause seems to be how the histogram bins are chosen: my first experiment had equi-distant bins, while the width of the KLL sketch histogram bins may vary. There might be some ways to improve the interpolation, but as of now I didn't follow that path. I've stumbled upon the t-digest in the tutorial that @leerho mentioned. The t-digest already supports interpolation. It seems to provide much better accuracy for large N, at least in my experiments. If I find a way to work around the accuracy problems for small N, it might be a valuable alternative to the KLL sketch. The accuracy problems are nicely shown in https://datasketches.apache.org/docs/QuantilesStudies/KllSketchVsTDigest.html#t-digest-compression100. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
