Re: [PR] [fix](be) Clean linear histogram decimal boundaries [doris]

via GitHub Tue, 19 May 2026 22:01:15 -0700


github-actions[bot] commented on code in PR #63430:
URL: https://github.com/apache/doris/pull/63430#discussion_r3271315832



##########
be/src/exprs/aggregate/aggregate_function_linear_histogram.h:
##########
@@ -174,6 +180,50 @@ struct AggregateFunctionLinearHistogramData {
         auto& column = assert_cast<ColumnString&>(to);
         column.insert_data(buffer.GetString(), buffer.GetSize());
     }
+
+private:
+    double bucket_key(double value) const {
+        double key = (value - offset) / interval;

Review Comment:
   This tolerance changes bucket assignment for valid values that are close to, 
but still below, a boundary. For example, with `interval = 1`, `offset = 0`, 
and `value = 999999999.999999`, the mathematical bucket is 
`floor(999999999.999999) = 999999999`, but the tolerance is about `1.8e-6`, so 
this branch rounds it to `1000000000`. Near the configured upper key limit it 
can also turn a valid last-bucket value such as `2147483646.999999` into 
`2147483647` and throw `exceeds the bucket range limit`. The fix should avoid 
using an absolute key-space tolerance that can move real input values across 
bucket boundaries; restrict correction to the specific floating artifact being 
addressed or keep exact `floor` semantics for bucket selection.



##########
be/src/exprs/aggregate/aggregate_function_linear_histogram.h:
##########
@@ -174,6 +180,50 @@ struct AggregateFunctionLinearHistogramData {
         auto& column = assert_cast<ColumnString&>(to);
         column.insert_data(buffer.GetString(), buffer.GetSize());
     }
+
+private:
+    double bucket_key(double value) const {
+        double key = (value - offset) / interval;
+        double rounded_key = std::round(key);
+        double tolerance =
+                std::numeric_limits<double>::epsilon() * 
std::max(std::abs(key), 1.0) * 8;
+        if (std::abs(key - rounded_key) <= tolerance) {
+            return rounded_key;
+        }
+        return std::floor(key);
+    }
+
+    int32_t infer_boundary_scale() const {
+        int32_t interval_scale = decimal_scale(interval);
+        int32_t offset_scale = decimal_scale(offset);
+        if (interval_scale == NO_BOUNDARY_SCALE || offset_scale == 
NO_BOUNDARY_SCALE) {
+            return NO_BOUNDARY_SCALE;
+        }
+        return std::max(interval_scale, offset_scale);
+    }
+
+    int32_t decimal_scale(double value) const {
+        double multiplier = 1;
+        for (int32_t scale = 0; scale <= MAX_BOUNDARY_SCALE; ++scale, 
multiplier *= 10) {
+            double scaled_value = value * multiplier;

Review Comment:
   `decimal_scale()` treats very small nonzero values as scale 0 because the 
scale-0 tolerance is around `1.8e-15`. For `interval = 1e-16` and `offset = 0`, 
this returns boundary scale 0, so `bucket_boundary()` rounds boundaries like 
`1e-16` and `2e-16` to `0`, serializing buckets with identical/wrong lower and 
upper values. That regresses valid positive intervals compared with the 
previous direct double boundary output. Please make scale inference reject 
values that round to zero unless the original value is actually zero, or 
otherwise avoid applying decimal rounding when it would erase nonzero 
boundaries.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [fix](be) Clean linear histogram decimal boundaries [doris]

Reply via email to