legoscia opened a new pull request, #15381:
URL: https://github.com/apache/druid/pull/15381
There is a problem with Quantiles sketches and KLL Quantiles sketches.
Queries using the histogram post-aggregator fail if:
- the sketch contains at least one value, and
- the values in the sketch are all equal, and
- the splitPoints argument is not passed to the post-aggregator, and
- the numBins argument is greater than 2 (or not specified, which leads to
the default of 10 being used)
In that case, the query fails and returns this error:
{
"error": "Unknown exception",
"errorClass":
"org.apache.datasketches.common.SketchesArgumentException",
"host": null,
"errorCode": "legacyQueryException",
"persona": "OPERATOR",
"category": "RUNTIME_FAILURE",
"errorMessage": "Values must be unique, monotonically increasing and
not NaN.",
"context": {
"host": null,
"errorClass":
"org.apache.datasketches.common.SketchesArgumentException",
"legacyErrorCode": "Unknown exception"
}
}
This behaviour is undesirable, since the caller doesn't necessarily know in
advance whether the sketch has values that are diverse enough. With this
change, the post-aggregators return [N, 0, 0...] instead of crashing, where N
is the number of values in the sketch, and the length of the list is equal to
numBins. That is what they already returned for numBins = 2.
Here is an example of a query that would fail:
{"queryType":"timeseries",
"dataSource": {
"type": "inline",
"columnNames": ["foo", "bar"],
"rows": [
["abc", 42.0],
["def", 42.0]
]
},
"intervals":["0000/3000"],
"granularity":"all",
"aggregations":[
{"name":"the_sketch", "fieldName":"bar",
"type":"quantilesDoublesSketch"}],
"postAggregations":[
{"name":"the_histogram",
"type":"quantilesDoublesSketchToHistogram",
"field":{"type":"fieldAccess","fieldName":"the_sketch"},
"numBins": 3}]}
I believe this also fixes issue #10585.
<!-- Thanks for trying to help us make Apache Druid be the best it can be!
Please fill out as much of the following information as is possible (where
relevant, and remove it when irrelevant) to help make the intention and scope
of this PR clear in order to ease review. -->
<!-- Please read the doc for contribution
(https://github.com/apache/druid/blob/master/CONTRIBUTING.md) before making
this PR. Also, once you open a PR, please _avoid using force pushes and
rebasing_ since these make it difficult for reviewers to see what you've
changed in response to their reviews. See [the 'If your pull request shows
conflicts with master'
section](https://github.com/apache/druid/blob/master/CONTRIBUTING.md#if-your-pull-request-shows-conflicts-with-master)
for more details. -->
<!-- Replace XXXX with the id of the issue fixed in this PR. Remove this
section if there is no corresponding issue. Don't reference the issue in the
title of this pull-request. -->
<!-- If you are a committer, follow the PR action item checklist for
committers:
https://github.com/apache/druid/blob/master/dev/committer-instructions.md#pr-and-issue-action-item-checklist-for-committers.
-->
### Description
<!-- Describe the goal of this PR, what problem are you fixing. If there is
a corresponding issue (referenced above), it's not necessary to repeat the
description here, however, you may choose to keep one summary sentence. -->
I noticed this error when trying to get histograms from quantiles sketches.
At first it seemed intermittent and random, as it would go away when I changed
the query a bit, but eventually I realised that it depends on the underlying
data. topN queries are particularly susceptible, as it's enough for one of the
dimension values to have a sketch with a single value for the entire query to
fail.
<!-- Describe your patch: what did you change in code? How did you fix the
problem? -->
I'm checking for the case where `splitPoints` isn't explicitly specified,
but the minimum and maximum values of the sketch are equal. In that case, I
don't bother calling the `getPMF` method of the sketch, since the result is
given. Instead, I just return an array where the first element is the number of
values in the sketch.
I considered changing the list of split points to something that `getPMF`
would accept, e.g. setting `delta` to 1.0, or setting `max` to
`Double.MAX_VALUE` and calculating `delta` from that. In the end, I thought
that there is no obvious choice, and any way of coming up with artificial split
points could cause problems depending on which values are in the sketch. (For
example, if the minimum value is greater than 2^53, adding 1.0 becomes a no-op.)
#### Release note
Fixed: Histogram post-aggregators for Quantiles and KLL sketches no longer
fail if all values in the sketch are equal.
<hr>
##### Key changed/added classes in this PR
* `DoublesSketchToHistogramPostAggregator`
* `KllDoublesSketchToHistogramPostAggregator`
* `KllFloatsSketchToHistogramPostAggregator`
<hr>
<!-- Check the items by putting "x" in the brackets for the done things. Not
all of these items apply to every PR. Remove the items which are not done or
not relevant to the PR. None of the items from the checklist below are strictly
necessary, but it would be very helpful if you at least self-review the PR. -->
This PR has:
- [x] been self-reviewed.
- [x] a release note entry in the PR description.
- [x] added comments explaining the "why" and the intent of the code
wherever would not be obvious for an unfamiliar reader.
- [x] added unit tests or modified existing tests to cover new code paths,
ensuring the threshold for [code
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
is met.
- [x] been tested in a test Druid cluster.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]