zhzy0077 opened a new issue, #5810:
URL: https://github.com/apache/arrow-datafusion/issues/5810

   ### Describe the bug
   
   When running statistics, selectivity is calculated by the distance between 
selected range and total range. See analyze_expr_scalar_comparison.
   
   While calculating the distance, scalar::distance is simply subtracting MIN 
from MAX and it's possible it panicked or returns a negative number when it 
overflows.
   
   ### To Reproduce
   
   Create a parquet file with max value i64::MAX and min value i64::MIN.
   Run a query like:
   ```rust
       let ctx = SessionContext::new();
       let df = ctx.read_parquet("<file>.parquet", 
ParquetReadOptions::default()).await?;
       let df = df
           .filter(col("value").lt(lit(0 as i64)))?
           .aggregate(vec![], vec![max(col("value"))])?;
       df.show().await?;
   ```
   
   ### Expected behavior
   
   It shows the result.
   
   ### Additional context
   
   It panicked when running in debug mode.
   ```
   thread 'main' panicked at 'attempt to subtract with overflow', 
~/repo/arrow-datafusion/datafusion/common/src/scalar.rs:1816:9
   stack backtrace:
      0: rust_begin_unwind
                at 
/rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/std/src/panicking.rs:575:5
      1: core::panicking::panic_fmt
                at 
/rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/core/src/panicking.rs:64:14
      2: core::panicking::panic
                at 
/rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/core/src/panicking.rs:114:5
      3: datafusion_common::scalar::ScalarValue::sub
                at 
~/repo/arrow-datafusion/datafusion/common/src/scalar.rs:1816:9
      4: datafusion_common::scalar::ScalarValue::distance
                at 
~/repo/arrow-datafusion/datafusion/common/src/scalar.rs:1885:13
      5: 
datafusion_physical_expr::expressions::binary::analyze_expr_scalar_comparison
                at 
~/repo/arrow-datafusion/datafusion/physical-expr/src/expressions/binary.rs:861:57
      6: <datafusion_physical_expr::expressions::binary::BinaryExpr as 
datafusion_physical_expr::physical_expr::PhysicalExpr>::analyze
                at 
~/repo/arrow-datafusion/datafusion/physical-expr/src/expressions/binary.rs:732:25
      7: <datafusion::physical_plan::filter::FilterExec as 
datafusion::physical_plan::ExecutionPlan>::statistics
                at 
~/repo/arrow-datafusion/datafusion/core/src/physical_plan/filter.rs:183:28
      8: datafusion::physical_optimizer::aggregate_statistics::take_optimizable
                at 
~/repo/arrow-datafusion/datafusion/core/src/physical_optimizer/aggregate_statistics.rs:127:37
      9: 
<datafusion::physical_optimizer::aggregate_statistics::AggregateStatistics as 
datafusion::physical_optimizer::optimizer::PhysicalOptimizerRule>::optimize
                at 
~/repo/arrow-datafusion/datafusion/core/src/physical_optimizer/aggregate_statistics.rs:56:41
     10: 
datafusion::physical_plan::planner::DefaultPhysicalPlanner::optimize_internal
                at 
~/repo/arrow-datafusion/datafusion/core/src/physical_plan/planner.rs:1794:24
     11: <datafusion::physical_plan::planner::DefaultPhysicalPlanner as 
datafusion::physical_plan::planner::PhysicalPlanner>::create_physical_plan::{{closure}}
                at 
~/repo/arrow-datafusion/datafusion/core/src/physical_plan/planner.rs:427:17
     12: <core::pin::Pin<P> as core::future::future::Future>::poll
                at 
/rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/core/src/future/future.rs:125:9
     13: <datafusion::execution::context::DefaultQueryPlanner as 
datafusion::execution::context::QueryPlanner>::create_physical_plan::{{closure}}
                at 
~/repo/arrow-datafusion/datafusion/core/src/execution/context.rs:1175:13
     14: <core::pin::Pin<P> as core::future::future::Future>::poll
                at 
/rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/core/src/future/future.rs:125:9
     15: 
datafusion::execution::context::SessionState::create_physical_plan::{{closure}}
                at 
~/repo/arrow-datafusion/datafusion/core/src/execution/context.rs:1670:13
     16: datafusion::dataframe::DataFrame::create_physical_plan::{{closure}}
                at 
~/repo/arrow-datafusion/datafusion/core/src/dataframe.rs:99:60
     17: datafusion::dataframe::DataFrame::collect::{{closure}}
                at 
~/repo/arrow-datafusion/datafusion/core/src/dataframe.rs:663:47
     18: datafusion::dataframe::DataFrame::show::{{closure}}
                at 
~/repo/arrow-datafusion/datafusion/core/src/dataframe.rs:681:37
     19: rust_sample::main::{{closure}}
                at ./src/main.rs:33:14
     20: tokio::runtime::park::CachedParkThread::block_on::{{closure}}
                at 
~/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.26.0/src/runtime/park.rs:283:63
     21: tokio::runtime::coop::with_budget
                at 
~/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.26.0/src/runtime/coop.rs:107:5
     22: tokio::runtime::coop::budget
                at 
~/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.26.0/src/runtime/coop.rs:73:5
     23: tokio::runtime::park::CachedParkThread::block_on
                at 
~/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.26.0/src/runtime/park.rs:283:31
     24: tokio::runtime::context::BlockingRegionGuard::block_on
                at 
~/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.26.0/src/runtime/context.rs:315:13
     25: tokio::runtime::scheduler::multi_thread::MultiThread::block_on
                at 
~/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.26.0/src/runtime/scheduler/multi_thread/mod.rs:66:9
     26: tokio::runtime::runtime::Runtime::block_on
                at 
~/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.26.0/src/runtime/runtime.rs:304:45
     27: rust_sample::main
                at ./src/main.rs:69:5
     28: core::ops::function::FnOnce::call_once
                at 
/rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/core/src/ops/function.rs:250:5
   ```
   
   While it doesn't panic in release mode but it's subject to pick sub-optimal 
plans because of the wrong stats.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to