alamb commented on code in PR #6982:
URL: https://github.com/apache/arrow-datafusion/pull/6982#discussion_r1268660307
##########
datafusion/common/src/scalar.rs:
##########
@@ -3496,6 +3496,22 @@ impl ScalarValue {
.map(|sv| sv.size() - std::mem::size_of_val(sv))
.sum::<usize>()
}
+
+ /// This function returns the next/previous value depending on the `DIR`
value.
+ /// If `true`, it returns the next value; otherwise it returns the
previous value.
+ pub fn next_value<const DIR: bool>(self) -> ScalarValue {
Review Comment:
I think I gave the opposite feedback in my initial review . I am sorry :(
##########
datafusion/physical-expr/src/intervals/interval_aritmetic.rs:
##########
@@ -1172,4 +1321,70 @@ mod tests {
let upper = 1.5;
capture_mode_change_f32((lower, upper), true, true);
}
+
+ #[test]
+ fn test_cardinality_of_intervals() -> Result<()> {
+ // In IEEE 754 standard for floating-point arithmetic, if we keep the
sign and exponent fields same,
+ // we can represent 4503599627370496 different numbers by changing the
mantissa
+ // (4503599627370496 = 2^52, since there are 52 bits in mantissa, and
2^23 = 8388608 for f32).
+ let distinct_f64 = 4503599627370496;
+ let distinct_f32 = 8388608;
+ let intervals = [
+ Interval::new(
+ IntervalBound::new(ScalarValue::from(0.25), false),
+ IntervalBound::new(ScalarValue::from(0.50), true),
+ ),
+ Interval::new(
+ IntervalBound::new(ScalarValue::from(0.5), false),
+ IntervalBound::new(ScalarValue::from(1.0), true),
+ ),
+ Interval::new(
+ IntervalBound::new(ScalarValue::from(1.0), false),
+ IntervalBound::new(ScalarValue::from(2.0), true),
+ ),
+ Interval::new(
+ IntervalBound::new(ScalarValue::from(32.0), false),
+ IntervalBound::new(ScalarValue::from(64.0), true),
+ ),
+ Interval::new(
+ IntervalBound::new(ScalarValue::from(-0.50), false),
+ IntervalBound::new(ScalarValue::from(-0.25), true),
+ ),
+ Interval::new(
+ IntervalBound::new(ScalarValue::from(-32.0), false),
+ IntervalBound::new(ScalarValue::from(-16.0), true),
+ ),
+ ];
+ for interval in intervals {
+ assert_eq!(interval.cardinality()?, distinct_f64);
+ }
+
+ let intervals = [
+ Interval::new(
+ IntervalBound::new(ScalarValue::from(0.25_f32), false),
+ IntervalBound::new(ScalarValue::from(0.50_f32), true),
+ ),
+ Interval::new(
+ IntervalBound::new(ScalarValue::from(-1_f32), false),
+ IntervalBound::new(ScalarValue::from(-0.5_f32), true),
+ ),
+ ];
+ for interval in intervals {
+ assert_eq!(interval.cardinality()?, distinct_f32);
+ }
+
+ let interval = Interval::new(
+ IntervalBound::new(ScalarValue::from(-0.0625), false),
+ IntervalBound::new(ScalarValue::from(0.0625), true),
+ );
+ assert_eq!(interval.cardinality()?, distinct_f64 * 2_048);
Review Comment:
Yes I agree -- all selectivity estimates are going to have errors introduced
by assumptions made about the distribution (which is not fully known). Assuming
a uniform distribution is an easy to understand choice, but of course causes
substantial estimation error with skewed distributions
It was common for sophisticated cost models in other systems to have
histogram information, but I think the current thinking is that it is better to
be more adaptive / tolerant of bad cardinality estimations than to try and
improve the cost models more.
##########
datafusion/physical-expr/src/intervals/interval_aritmetic.rs:
##########
@@ -451,6 +453,75 @@ impl Interval {
lower: IntervalBound::new(ScalarValue::Boolean(Some(true)), false),
upper: IntervalBound::new(ScalarValue::Boolean(Some(true)), false),
};
+
+ // Cardinality is the number of all points included by the interval,
considering its bounds.
Review Comment:
Something that might be worth considering in the long term that @tustvold
mentioned the other day is to vectorize these calculations -- at the moment
they are doin in the context of a single expression, but eventually if we want
to use this logic to prune large numbers of files / etc based on statistics it
may take too long
No change is needed here, I am just planting a seed of an idea in case this
was on your list too
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]