[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6982: Refactor AnalysisContext and statistics() of FilterExec

via GitHub Wed, 19 Jul 2023 14:00:02 -0700


alamb commented on code in PR #6982:
URL: https://github.com/apache/arrow-datafusion/pull/6982#discussion_r1268660307



##########
datafusion/common/src/scalar.rs:
##########
@@ -3496,6 +3496,22 @@ impl ScalarValue {
                 .map(|sv| sv.size() - std::mem::size_of_val(sv))
                 .sum::<usize>()
     }
+
+    /// This function returns the next/previous value depending on the `DIR` 
value.
+    /// If `true`, it returns the next value; otherwise it returns the 
previous value.
+    pub fn next_value<const DIR: bool>(self) -> ScalarValue {

Review Comment:
   I think I gave the opposite feedback in my initial review . I am sorry :(



##########
datafusion/physical-expr/src/intervals/interval_aritmetic.rs:
##########
@@ -1172,4 +1321,70 @@ mod tests {
         let upper = 1.5;
         capture_mode_change_f32((lower, upper), true, true);
     }
+
+    #[test]
+    fn test_cardinality_of_intervals() -> Result<()> {
+        // In IEEE 754 standard for floating-point arithmetic, if we keep the 
sign and exponent fields same,
+        // we can represent 4503599627370496 different numbers by changing the 
mantissa
+        // (4503599627370496 = 2^52, since there are 52 bits in mantissa, and 
2^23 = 8388608 for f32).
+        let distinct_f64 = 4503599627370496;
+        let distinct_f32 = 8388608;
+        let intervals = [
+            Interval::new(
+                IntervalBound::new(ScalarValue::from(0.25), false),
+                IntervalBound::new(ScalarValue::from(0.50), true),
+            ),
+            Interval::new(
+                IntervalBound::new(ScalarValue::from(0.5), false),
+                IntervalBound::new(ScalarValue::from(1.0), true),
+            ),
+            Interval::new(
+                IntervalBound::new(ScalarValue::from(1.0), false),
+                IntervalBound::new(ScalarValue::from(2.0), true),
+            ),
+            Interval::new(
+                IntervalBound::new(ScalarValue::from(32.0), false),
+                IntervalBound::new(ScalarValue::from(64.0), true),
+            ),
+            Interval::new(
+                IntervalBound::new(ScalarValue::from(-0.50), false),
+                IntervalBound::new(ScalarValue::from(-0.25), true),
+            ),
+            Interval::new(
+                IntervalBound::new(ScalarValue::from(-32.0), false),
+                IntervalBound::new(ScalarValue::from(-16.0), true),
+            ),
+        ];
+        for interval in intervals {
+            assert_eq!(interval.cardinality()?, distinct_f64);
+        }
+
+        let intervals = [
+            Interval::new(
+                IntervalBound::new(ScalarValue::from(0.25_f32), false),
+                IntervalBound::new(ScalarValue::from(0.50_f32), true),
+            ),
+            Interval::new(
+                IntervalBound::new(ScalarValue::from(-1_f32), false),
+                IntervalBound::new(ScalarValue::from(-0.5_f32), true),
+            ),
+        ];
+        for interval in intervals {
+            assert_eq!(interval.cardinality()?, distinct_f32);
+        }
+
+        let interval = Interval::new(
+            IntervalBound::new(ScalarValue::from(-0.0625), false),
+            IntervalBound::new(ScalarValue::from(0.0625), true),
+        );
+        assert_eq!(interval.cardinality()?, distinct_f64 * 2_048);

Review Comment:
   Yes I agree -- all selectivity estimates are going to have errors introduced 
by assumptions made about the distribution (which is not fully known). Assuming 
a uniform distribution is an easy to understand choice, but of course causes 
substantial estimation error with skewed distributions
   
   It was common for sophisticated cost models in other systems to have 
histogram information, but I think the current thinking is that it is better to 
be more adaptive / tolerant of bad cardinality estimations than to try and 
improve the cost models more.



##########
datafusion/physical-expr/src/intervals/interval_aritmetic.rs:
##########
@@ -451,6 +453,75 @@ impl Interval {
         lower: IntervalBound::new(ScalarValue::Boolean(Some(true)), false),
         upper: IntervalBound::new(ScalarValue::Boolean(Some(true)), false),
     };
+
+    // Cardinality is the number of all points included by the interval, 
considering its bounds.

Review Comment:
   Something that might be worth considering in the long term that @tustvold 
mentioned the other day is to vectorize these calculations -- at the moment 
they are doin in the context of a single expression, but eventually if we want 
to use this logic to prune large numbers of files / etc based on statistics it 
may take too long
   
   No change is needed here, I am just planting a seed of an idea in case this 
was on your list too



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6982: Refactor AnalysisContext and statistics() of FilterExec

Reply via email to