alamb commented on code in PR #9007:
URL: https://github.com/apache/arrow-datafusion/pull/9007#discussion_r1468143094
##########
datafusion/physical-plan/src/joins/utils.rs:
##########
@@ -955,7 +955,12 @@ fn max_distinct_count(
let result = match num_rows {
Precision::Absent => Precision::Absent,
Precision::Inexact(count) => {
- Precision::Inexact(count -
stats.null_count.get_value().unwrap_or(&0))
+ // To safeguard against inexact number of rows (e.g. 0)
being smaller than
+ // an exact null count we need to do a checked subtraction.
+ match
count.checked_sub(*stats.null_count.get_value().unwrap_or(&0)) {
+ None => Precision::Inexact(0),
Review Comment:
I agree -- the use of `Statistics::get_value()` I think may also have other
bugs as `get_value()` may be exact or inexact but there are some places in the
code that treat it as though it were always `exact` (like here)
I have hopes to improve statistics in general (see
https://github.com/apache/arrow-datafusion/issues/8227) but other higher
priority things have kept me busy. I think @berkaysynnada was also working on
this item for a while -- I am not sure if they have any short term plans
##########
datafusion/physical-plan/src/joins/utils.rs:
##########
@@ -1670,133 +1677,156 @@ mod tests {
//
Review Comment:
I double checked that this code fails without the code change in this PR 👍
```shell
attempt to subtract with overflow
thread 'joins::utils::tests::test_inner_join_cardinality_single_column'
panicked at
/rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/ops/arith.rs:217:1:
attempt to subtract with overflow
stack backtrace:
0: rust_begin_unwind
at
/rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:645:5
1: core::panicking::panic_fmt
at
/rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/panicking.rs:72:14
2: core::panicking::panic
at
/rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/panicking.rs:127:5
3: <usize as core::ops::arith::Sub>::sub
at
/rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/ops/arith.rs:217:1
4: <&usize as core::ops::arith::Sub<&usize>>::sub
at
/rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/internal_macros.rs:55:17
5: datafusion_physical_plan::joins::utils::max_distinct_count
at ./src/joins/utils.rs:960:40
6: datafusion_physical_plan::joins::utils::estimate_inner_join_cardinality
at ./src/joins/utils.rs:911:33
7:
datafusion_physical_plan::joins::utils::tests::test_inner_join_cardinality_single_column
at ./src/joins/utils.rs:1818:17
8:
datafusion_physical_plan::joins::utils::tests::test_inner_join_cardinality_single_column::{{closure}}
at ./src/joins/utils.rs:1666:55
9: core::ops::function::FnOnce::call_once
at
/rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/ops/function.rs:250:5
10: core::ops::function::FnOnce::call_once
at
/rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose
backtrace.
error: test failed, to rerun pass `--lib`
error: 1 target failed:
`--lib`
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]