xanderbailey opened a new pull request, #20875:
URL: https://github.com/apache/datafusion/pull/20875

   ## Which issue does this PR close?
   
   N/A — preventive measure inspired by #20683 / #20672 where `RepartitionExec` 
silently dropped rows when spilling under memory pressure.
   
   ## Rationale for this change
   
   The bug fixed in #20672 (repartition dropping data when spilling) was a 
silent correctness issue — queries returned wrong results with no error. This 
class of bug is particularly dangerous because it can go undetected.
   
   DataFusion operators already declare their expected cardinality effect via 
`CardinalityEffect::Equal` (meaning "I output exactly as many rows as I 
receive"), but nothing actually verified this at runtime. This PR adds a 
post-execution sanity check that catches any such violation, so bugs like 
#20683 can be detected immediately rather than silently producing wrong results.
   
   ## What changes are included in this PR?
   
   1. **New session config flag** 
`datafusion.execution.verify_cardinality_effect` (default: `false`) in 
`ExecutionOptions`
   2. **New `cardinality_check` module** in `datafusion-physical-plan` with a 
`validate_cardinality_effect()` function that walks the executed plan tree 
using `ExecutionPlanVisitor` and verifies that every operator declaring 
`CardinalityEffect::Equal` produced the same number of output rows as its input 
(based on post-execution metrics)
   3. **Hooks in `collect()` and `collect_partitioned()`** that run the 
validation after all streams are consumed, when the config flag is enabled
   4. **`#[derive(Debug, Clone, Copy)]` on `CardinalityEffect`** — the enum was 
missing these basic derives
   
   The check gracefully skips operators where metrics are unavailable, where 
`fetch` limits are set, or where the operator is not unary (one in - one out). 
When a violation is detected, it returns `DataFusionError::Internal` with the 
operator name and mismatched row counts.
   
   ## Are these changes tested?
   
   Yes. Six unit tests cover:
   - Equal cardinality passes (matching row counts)
   - Equal cardinality violation detected (mismatched row counts)
   - Skips operators with `fetch` set
   - Skips operators without metrics
   - Skips non-Equal operators
   - Detects violations in nested plan trees
   
   ## Are there any user-facing changes?
   
   New config option `datafusion.execution.verify_cardinality_effect` (default 
`false`). When enabled via `SET execution.verify_cardinality_effect = true`, 
DataFusion will validate row count invariants after query execution and return 
an error if any operator that should preserve row counts (like 
`RepartitionExec`, `ProjectionExec`, `CoalesceBatchesExec`, `SortExec`, etc.) 
produces a different number of rows than its input.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to