scheler opened a new issue, #17822:
URL: https://github.com/apache/druid/issues/17822
Faulty/incorrect values noticed in FixedBucketsHistogram column after rollup.
### Affected Version
30.0.1
### Description
We are noticing some entries where the bucket counts array in the
FixedBucketsHistogram column has incorrect values. This is leading to incorrect
computation of percentiles. The data is ingested via Kafka, and we verified the
faulty records are not coming from the source. There is a rollup configured and
we have narrowed down to the rollup causing the faulty records to appear.
However, it is not clear how to troubleshoot this further.
For eg., the bucket values in the ingested records look like this -
```
1741883320000: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0]
1741883320000: [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0]
1741883330000: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0]
1741883330000: [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0]
1741883330000: [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0]
```
and after rollup, we see
1741883330000: [3150, 15, 6, 64, 91, 346, 1602, 1752, 2063, 971, 594, 221,
145, 25, 97, 13, 32, 35, 16, 12, 1, 3, 45, 17, 0]
Note that the above entries are based on extraction from the base64 data
from the column values.
```
FixedBucketsHistogram histogram = FixedBucketsHistogram.fromBase64(base64);
Arrays.toString(histogram.getHistogram()));
```
The rollup is configured for every 10s:
```
"granularitySpec": {
"type": "uniform",
"segmentGranularity": "HOUR",
"queryGranularity": {
"type": "duration",
"duration": 10000,
"origin": "1970-01-01T00:00:00.000Z"
},
"rollup": true,
"intervals": []
},
```
These faulty records are not many, maybe about 5-10 in a day, but the issue
is when they are included in an topN aggregation query they affect the results.
We have been unable to exclude them in the query, so I appreciate any ideas
around that too.
Any ideas on what could be causing this or how to troubleshoot this further?
Please let me know if any additional information would be helpful. Thanks!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]