gruuya opened a new issue, #9006: URL: https://github.com/apache/arrow-datafusion/issues/9006
### Describe the bug There's an edge case in the present `max_distinct_count` algorithm, whereby there can be an attempt to subtract a potentially larger number of total nulls from a inexact smaller number of total rows to get the distinct values https://github.com/apache/arrow-datafusion/blob/bee7136a04c60a2c06caa630cf1b72f32f7dc574/datafusion/physical-plan/src/joins/utils.rs#L957-L959 This leads to a panic with `attempt to subtract with overflow`. ### To Reproduce Extract the three parquet files from [files.zip](https://github.com/apache/arrow-datafusion/files/14064122/files.zip) needed for the repro. These were generated using DuckDB with SF=0.01 for TPC-DS benchamrks. The example below is a minimal repro for an issue observed for query 24 from that benchmark. ```rust #[tokio::main] async fn main() -> Result<()> { let ctx = SessionContext::new(); let file_format = ParquetFormat::default().with_enable_pruning(Some(true)); let listing_options = ListingOptions::new(Arc::new(file_format)) .with_file_extension(FileType::PARQUET.get_ext()); ctx.register_listing_table( "store", &format!("/path/to/store.parquet"), listing_options.clone(), None, None, ) .await?; ctx.register_listing_table( "store_sales", &format!("/path/to/store_sales.parquet"), listing_options.clone(), None, None, ) .await?; ctx.register_listing_table( "customer", &format!("/path/to/customer.parquet"), listing_options, None, None, ) .await?; let df = ctx .sql( "SELECT c_last_name, c_first_name, s_store_name, s_state FROM store_sales, store, customer WHERE ss_customer_sk = c_customer_sk AND ss_store_sk = s_store_sk AND s_market_id=8", ) .await?; // print the results df.show().await?; Ok(()) } ``` The above code panics with: ```bash thread 'main' panicked at datafusion/physical-plan/src/joins/utils.rs:958:40: attempt to subtract with overflow ``` Note that you can get a repro with the cli by appending `DATAFUSION_EXECUTION_COLLECT_STATISTICS=true DATAFUSION_EXECUTION_TARGET_PARTITIONS=1` to `cargo run` ### Expected behavior The example shouldn't panic, but instead return an empty result. ### Additional context As for the question how does this situation even occur in the first place, from my brief investigation I'm seeing that: 1. The `FilterExec` for the filtering on the `store` table predicate returns `Inexact(0)` as the number of rows for it's output statistics, since the predicate refutes all the input rows (in the case of `store` above there's only a single row with `s_market_id` equals to 2). 2. When joining `store` and `store_sales` the join cardinality estimate is 0 due to the above filtering, but the column statistic are nonetheless merged as is (meaning an exact null count for the `store_sales` columns is inherited) https://github.com/apache/arrow-datafusion/blob/bee7136a04c60a2c06caa630cf1b72f32f7dc574/datafusion/physical-plan/src/joins/utils.rs#L848-L859 3. Finally during merging with `customer` the statistics from step 2 enters into play, and when it reaches `max_distinct_count` it hits the `num_rows` being `Inexact(0)` but `stats.null_count` being exact and greater than zero edge case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
