gruuya opened a new issue, #9006:
URL: https://github.com/apache/arrow-datafusion/issues/9006

   ### Describe the bug
   
   There's an edge case in the present `max_distinct_count` algorithm, whereby 
there can be an attempt to subtract a potentially larger number of total nulls 
from a inexact smaller number of total rows to get the distinct values
   
https://github.com/apache/arrow-datafusion/blob/bee7136a04c60a2c06caa630cf1b72f32f7dc574/datafusion/physical-plan/src/joins/utils.rs#L957-L959
   
   This leads to a panic with `attempt to subtract with overflow`.
   
   ### To Reproduce
   
   Extract the three parquet files from 
[files.zip](https://github.com/apache/arrow-datafusion/files/14064122/files.zip)
 needed for the repro. These were generated using DuckDB with SF=0.01 for 
TPC-DS benchamrks. The example below is a minimal repro for an issue observed 
for query 24 from that benchmark.
   
   ```rust
   #[tokio::main]
   async fn main() -> Result<()> {
       let ctx = SessionContext::new();
   
       let file_format = 
ParquetFormat::default().with_enable_pruning(Some(true));
       let listing_options = ListingOptions::new(Arc::new(file_format))
           .with_file_extension(FileType::PARQUET.get_ext());
   
       ctx.register_listing_table(
           "store",
           &format!("/path/to/store.parquet"),
           listing_options.clone(),
           None,
           None,
       )
       .await?;
       ctx.register_listing_table(
           "store_sales",
           &format!("/path/to/store_sales.parquet"),
           listing_options.clone(),
           None,
           None,
       )
           .await?;
       ctx.register_listing_table(
           "customer",
           &format!("/path/to/customer.parquet"),
           listing_options,
           None,
           None,
       )
           .await?;
   
       let df = ctx
           .sql(
               "SELECT c_last_name,
          c_first_name,
          s_store_name,
          s_state
   FROM store_sales,
        store,
        customer
   WHERE ss_customer_sk = c_customer_sk
     AND ss_store_sk = s_store_sk
     AND s_market_id=8",
           )
           .await?;
   
       // print the results
       df.show().await?;
   
       Ok(())
   }
   ```
   
   The above code panics with:
   ```bash
   thread 'main' panicked at datafusion/physical-plan/src/joins/utils.rs:958:40:
   attempt to subtract with overflow
   ```
   
   Note that you can get a repro with the cli by appending 
`DATAFUSION_EXECUTION_COLLECT_STATISTICS=true 
DATAFUSION_EXECUTION_TARGET_PARTITIONS=1` to `cargo run`
   
   ### Expected behavior
   
   The example shouldn't panic, but instead return an empty result.
   
   ### Additional context
   
   As for the question how does this situation even occur in the first place, 
from my brief investigation I'm seeing that:
   
   1. The `FilterExec` for the filtering on the `store` table predicate returns 
`Inexact(0)` as the number of rows for it's output statistics, since the 
predicate refutes all the input rows (in the case of `store` above there's only 
a single row with `s_market_id` equals to 2).
   2. When joining `store` and `store_sales` the join cardinality estimate is 0 
due to the above filtering, but the column statistic are nonetheless merged as 
is (meaning an exact null count for the `store_sales` columns is inherited)
   
https://github.com/apache/arrow-datafusion/blob/bee7136a04c60a2c06caa630cf1b72f32f7dc574/datafusion/physical-plan/src/joins/utils.rs#L848-L859
   3. Finally during merging with `customer` the statistics from step 2 enters 
into play, and when it reaches `max_distinct_count` it hits the `num_rows` 
being `Inexact(0)` but `stats.null_count` being exact and greater than zero 
edge case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to