Jan, WRT t-digest: We did a comparison study of the KLL sketch vs the t-digest <https://datasketches.apache.org/docs/QuantilesStudies/KllSketchVsTDigest.html>. Also, there is a more recent and formal comparison study comparing the ReqSketch vs. the t-digest <https://arxiv.org/abs/2102.09299>.
Just be aware that the t-digest is an empirical algorithm that cannot provide any *a priori* or *a posteriori* error guarantees because it is input data-sensitive. The sketches from the DataSketches library are input data-insensitive and have mathematically proven and predictable error properties (i.e., with well-defined confidence intervals) no matter what the input data looks like. Although the t-digest can perform very well on "well-behaved" data distributions, the problem is that the t-digest sketch cannot provide any warning that the results it is producing may, in fact, be way off. So as a user, you can never really be sure how good its results are. Caveat Emptor! Cheers, Lee. On Fri, Mar 19, 2021 at 10:40 AM Jan Prach (Jira) <[email protected]> wrote: > > [ > https://issues.apache.org/jira/browse/DATASKETCHES-10?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17305075#comment-17305075 > ] > > Jan Prach commented on DATASKETCHES-10: > --------------------------------------- > > [~veselyp] true! The power of open source :) I was thinking of it. But > I've opened this ticket to see where DataSketches are heading. And in the > meanwhile I've hooked up t-digest today. > > > Double precision by default? > > ---------------------------- > > > > Key: DATASKETCHES-10 > > URL: > https://issues.apache.org/jira/browse/DATASKETCHES-10 > > Project: Apache Datasketches > > Issue Type: Improvement > > Reporter: Jan Prach > > Priority: Major > > > > Would it make sense to use double (instead of float) for all sketches by > default? > > It would take (less than 2x) more memory, have same speed, have twice > the storage. Or even the same storage if one is fine with the flaot > precision. Most importantly it would be far more useful. > > I' trying to build generic profiler. In the first simple dataset there > were a couple of date and timestamp columns. The obvious choice is to > convert them to epoch seconds. Full day of time with weird messages only to > realize that KllFloatsSketch, ReqSketch, etc. are all based on floats. That > means 24 bit precision. But epoch seconds today are 31 bit numbers. > > Why not always double? > > > > -- > This message was sent by Atlassian Jira > (v8.3.4#803005) > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
