Jan,
WRT t-digest:  We did a comparison study of the KLL sketch vs the t-digest
<https://datasketches.apache.org/docs/QuantilesStudies/KllSketchVsTDigest.html>.
Also, there is a more recent and formal comparison study comparing the
ReqSketch
vs. the t-digest <https://arxiv.org/abs/2102.09299>.

Just be aware that the t-digest is an empirical algorithm that cannot
provide any *a priori* or *a posteriori* error guarantees because it is
input data-sensitive. The sketches from the DataSketches library are input
data-insensitive and have mathematically proven and predictable error
properties (i.e., with well-defined confidence intervals) no matter what
the input data looks like.

 Although the t-digest can perform very well on "well-behaved" data
distributions, the problem is that the t-digest sketch cannot provide any
warning that the results it is producing may, in fact, be way off.  So as a
user, you can never really be sure how good its results are.

Caveat Emptor!

Cheers,

Lee.








On Fri, Mar 19, 2021 at 10:40 AM Jan Prach (Jira) <[email protected]> wrote:

>
>     [
> https://issues.apache.org/jira/browse/DATASKETCHES-10?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17305075#comment-17305075
> ]
>
> Jan Prach commented on DATASKETCHES-10:
> ---------------------------------------
>
> [~veselyp] true! The power of open source :) I was thinking of it. But
> I've opened this ticket to see where DataSketches are heading. And in the
> meanwhile I've hooked up t-digest today.
>
> > Double precision by default?
> > ----------------------------
> >
> >                 Key: DATASKETCHES-10
> >                 URL:
> https://issues.apache.org/jira/browse/DATASKETCHES-10
> >             Project: Apache Datasketches
> >          Issue Type: Improvement
> >            Reporter: Jan Prach
> >            Priority: Major
> >
> > Would it make sense to use double (instead of float) for all sketches by
> default?
> > It would take (less than 2x) more memory, have same speed, have twice
> the storage. Or even the same storage if one is fine with the flaot
> precision. Most importantly it would be far more useful.
> > I' trying to build generic profiler. In the first simple dataset there
> were a couple of date  and timestamp columns. The obvious choice is to
> convert them to epoch seconds. Full day of time with weird messages only to
> realize that KllFloatsSketch, ReqSketch, etc. are all based on floats. That
> means 24 bit precision. But epoch seconds today are 31 bit numbers.
> > Why not always double?
>
>
>
> --
> This message was sent by Atlassian Jira
> (v8.3.4#803005)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Reply via email to