alamb commented on code in PR #9007:
URL: https://github.com/apache/arrow-datafusion/pull/9007#discussion_r1468143094


##########
datafusion/physical-plan/src/joins/utils.rs:
##########
@@ -955,7 +955,12 @@ fn max_distinct_count(
             let result = match num_rows {
                 Precision::Absent => Precision::Absent,
                 Precision::Inexact(count) => {
-                    Precision::Inexact(count - 
stats.null_count.get_value().unwrap_or(&0))
+                    // To safeguard against inexact number of rows (e.g. 0) 
being smaller than
+                    // an exact null count we need to do a checked subtraction.
+                    match 
count.checked_sub(*stats.null_count.get_value().unwrap_or(&0)) {
+                        None => Precision::Inexact(0),

Review Comment:
   I agree -- the use of `Statistics::get_value()` I think may also have other 
bugs as `get_value()` may be exact or inexact but there are some places in the 
code that treat it as though it were always `exact` (like here)
   
   I have hopes to improve statistics in general (see 
https://github.com/apache/arrow-datafusion/issues/8227) but other higher 
priority things have kept me busy. I think @berkaysynnada  was also working on 
this item for a while -- I am not sure if they have any short term plans



##########
datafusion/physical-plan/src/joins/utils.rs:
##########
@@ -1670,133 +1677,156 @@ mod tests {
             //

Review Comment:
   I double checked that this code fails without the code change in this PR 👍 
   
   ```shell
   attempt to subtract with overflow
   thread 'joins::utils::tests::test_inner_join_cardinality_single_column' 
panicked at 
/rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/ops/arith.rs:217:1:
   attempt to subtract with overflow
   stack backtrace:
      0: rust_begin_unwind
                at 
/rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:645:5
      1: core::panicking::panic_fmt
                at 
/rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/panicking.rs:72:14
      2: core::panicking::panic
                at 
/rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/panicking.rs:127:5
      3: <usize as core::ops::arith::Sub>::sub
                at 
/rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/ops/arith.rs:217:1
      4: <&usize as core::ops::arith::Sub<&usize>>::sub
                at 
/rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/internal_macros.rs:55:17
      5: datafusion_physical_plan::joins::utils::max_distinct_count
                at ./src/joins/utils.rs:960:40
      6: datafusion_physical_plan::joins::utils::estimate_inner_join_cardinality
                at ./src/joins/utils.rs:911:33
      7: 
datafusion_physical_plan::joins::utils::tests::test_inner_join_cardinality_single_column
                at ./src/joins/utils.rs:1818:17
      8: 
datafusion_physical_plan::joins::utils::tests::test_inner_join_cardinality_single_column::{{closure}}
                at ./src/joins/utils.rs:1666:55
      9: core::ops::function::FnOnce::call_once
                at 
/rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/ops/function.rs:250:5
     10: core::ops::function::FnOnce::call_once
                at 
/rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/ops/function.rs:250:5
   note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose 
backtrace.
   
   error: test failed, to rerun pass `--lib`
   error: 1 target failed:
       `--lib`
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to