This paper does not solve the relative error problem we've discussed
previously. Which isn't to say it's not interesting, but it's not the same
problem (it's unfortunate that the paper picked the same name that we've
been using, when the problems are actually very different).
Here's the situation.
When we've said "relative-error quantiles", we've meant what the paper
calls "relative-error rank-error". That means, roughly, that if someone
specifies a query percentile q, the algorithm should return an item whose
true percentile is (1 \pm epsilon)*q for some small epsilon.
When the paper says "relative-error quantiles", what it means is that if
someone someone specifies a query percentile q, and the true
q'th-percentile-item in the dataset is x_q, then the algorithm should
return a item \tilde{x}_q such that |x_q - \tilde{x}_q|<=epsilon*x_q.
So in their problem, error is measured with respect to the _magnitude of
the data items_.
This can certainly make sense for numerical data falling in some understood
range, e.g., they focus on latency of web requests, which apparently are
typically between milliseconds and minutes.
But it is easy to come up with contrived examples where this notion of
error makes no sense. E.g., consider if all of the data items are numbers
in some small-length range, but the numbers themselves of huge (for
concreteness, say [10^{20}, 10^{20}+i] for some small value of i), their
definition permits additive error that is vastly bigger than the actual
spread of the data (in my contrived example, the error can grow like
10^{20}, even though error at most i can be achieved just by storing the
max and min).
In our problem, the actual data items are never referred to, only the
ranks---in fact, our problem makes sense even if data is not numerical, so
long as it is totally ordered.
Their algorithm is very simple, and it does have provable worst-case error
guarantees if one lets the space cost grow logarithmically with the
universe size. In the paper, they seem to want the space cost to be
constant independent of the universe size, and they develop some pruning
heuristics to deal with this, but these heuristics don't (and can't) come
with worst-case guarantees. All of their experiments and analysis of these
settings are for highly "well-behaved" data (e.g., iid from very
well-understood distributions).
They seem to be the first work (other than HDRHistogram, which I haven't
looked at but which apparently doesn't come with provable error guarantees)
to study this notion of error for quantiles. I am guessing this is at least
in part because solving the problem they define is not super difficult from
an algorithmic perspective.
But, to be clear, if their notion of error actually makes sense for an
application, and one is happy letting the space grow logarithmically with
the universe size, then simplicity is a virtue, not a flaw.
BTW I knew one of the authors when I was starting grad school. We've lost
touch, but I'm happy to see a new paper by him!
On Sun, Oct 20, 2019 at 2:42 PM Jon Malkin <[email protected]> wrote:
> A cursory glance suggests that it's O(N) in size of you want guarantees
> against an adversary. The bucket collapsing method reduces accuracy in
> collapsed buckets. If so, the attack would be easiest when distributed to
> manipulate data splits so that you collapse significant blocks of data
> together when merging.
>
> I'll see if I have a chance to read things more carefully this coming week.
>
> jon
>
> On Sat, Oct 19, 2019, 10:32 AM leerho <[email protected]> wrote:
>
> > Science team,
> >
> > Would you please take a look at this sketch. It appears to address the
> > "relative error" for quantiles issue we have been discussing recently.
> >
> > https://blog.acolyer.org/2019/09/06/ddsketch/
> >
> > Thanks,
> >
> > Lee.
> >
>