pepijnve commented on code in PR #122:
URL: https://github.com/apache/datafusion-site/pull/122#discussion_r2732441232


##########
content/blog/2026-01-26-datafusion_case.md:
##########
@@ -0,0 +1,468 @@
+---
+layout: post
+title: Optimizing SQL CASE Expression Evaluation
+date: 2026-01-26
+author: Pepijn Van Eeckhoudt
+categories: [features]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+[TOC]
+
+<style>
+figure {
+  margin: 20px 0;
+}
+
+figure img {
+  display: block;
+  max-width: 80%;
+  margin: auto;
+}
+
+figcaption {
+  font-style: italic;
+  color: #555;
+  font-size: 0.9em;
+  max-width: 80%;
+  margin: auto;
+  text-align: center;
+}
+</style>
+
+[Apache DataFusion]: https://datafusion.apache.org/
+
+SQL's `CASE` expression is one of the few explicit conditional evaluation 
constructs the language provides.
+It lets you control which expression from a set of expressions is evaluated 
for each row based on arbitrary boolean expressions.
+Its deceptively simple syntax hides significant implementation complexity.
+Over the past few releases, we've landed a series of improvements to [Apache 
DataFusion]'s `CASE` expression evaluator that reduce both CPU time and memory 
allocations.
+This post walks through the original implementation, its performance 
bottlenecks, and how we addressed them step by step.
+
+
+## Background: CASE Expression Evaluation
+
+SQL supports two forms of CASE expressions:
+
+1. **Simple**: `CASE expr WHEN value1 THEN result1 WHEN value2 THEN result2 
... END`
+2. **Searched**: `CASE WHEN condition1 THEN result1 WHEN condition2 THEN 
result2 ... END`
+
+The simple form evaluates an expression once for each input row and then tests 
that value against the expressions (typically constants) in each `WHEN` clause 
using equality comparisons.
+Think of it as a limited Rust `match` expression.
+
+Here's an example of the simple form:
+
+```sql
+CASE status
+    WHEN 'pending' THEN 1
+    WHEN 'active' THEN 2
+    WHEN 'complete' THEN 3
+    ELSE 0
+END
+```
+
+In this `CASE` expression, `status` is evaluated once per row, and then its 
value is tested for equality with the values `'pending'`, `'active'`, and 
`'complete'` in that order.
+The `CASE` expression evaluates to the value of the `THEN` expression 
corresponding to the first matching `WHEN` expression.
+
+The searched `CASE` form is a more flexible variant.
+It evaluates completely independent boolean expressions for each branch.
+This allows you to test different columns with different operators per branch 
as can be seen in the following example:
+
+```sql
+CASE
+    WHEN age > 65 THEN 'senior'
+    WHEN childCount != 0 THEN 'parent'
+    WHEN age < 21 THEN 'minor'
+    ELSE 'adult'
+END
+```
+
+In both forms, branches are evaluated sequentially with short-circuit 
semantics: for each row, once a `WHEN` condition matches, the corresponding 
`THEN` expression is evaluated.
+Any further branches are not evaluated for that row.
+This lazy evaluation model is critical for correctness.
+It lets you safely write `CASE` expressions like
+
+```sql
+CASE
+    WHEN denominator == 0 THEN NULL
+    ELSE nominator / denominator
+END
+```
+
+that are guaranteed to not trigger divide-by-zero errors.
+
+Besides `CASE`, there are a few [conditional scalar 
functions](https://datafusion.apache.org/user-guide/sql/scalar_functions.html#conditional-functions)
 that provide similar, more restricted capabilities.
+These include `COALESCE`, `IFNULL`, and `NVL2`.
+
+Each of these functions can be seen as the equivalent of a macro for `CASE`.
+`COALESCE(expr1, expr2, expr3)` for instance, would expand to:
+
+```sql
+CASE
+  WHEN expr1 IS NOT NULL THEN expr1
+  WHEN expr2 IS NOT NULL THEN expr2
+  ELSE expr3
+END
+```
+
+[Apache DataFusion] implements these conditional functions by rewriting them 
to their equivalent `CASE` expression.
+As a consequence, any optimizations related to `CASE` described in this post 
also apply to conditional function evaluation.
+
+## `CASE` Evaluation in DataFusion 50.0.0
+
+For the remainder of this post we'll be looking at 'searched case' evaluation.
+'Simple case' uses a distinct, but very similar implementation.
+The same set of improvements has been applied to both.
+
+The baseline implementation in DataFusion 50.0.0 evaluated `CASE` using a 
straightforward approach:
+
+1. Start with an output array `out` with the same length as the input batch, 
filled with nulls. Additionally, create a bit vector `remainder` with the same 
length and each value set to `true`.
+2. For each `WHEN`/`THEN` branch:
+  - Evaluate the `WHEN` condition for remaining unmatched rows using 
[`PhysicalExpr::evaluate_selection`](https://docs.rs/datafusion/latest/datafusion/physical_expr/trait.PhysicalExpr.html#method.evaluate_selection),
 passing in the input batch and the `remainder` mask
+  - If any rows matched, evaluate the `THEN` expression for those rows using 
`PhysicalExpr::evaluate_selection`
+  - Merge the results into the `out` using the 
[`zip`](https://docs.rs/arrow/latest/arrow/compute/kernels/zip/fn.zip.html) 
kernel
+  - Update the `remainder` mask to exclude matched rows
+3. If there's an `ELSE` clause, evaluate it for any remaining unmatched rows 
and merge using 
[`zip`](https://docs.rs/arrow/latest/arrow/compute/kernels/zip/fn.zip.html)
+
+Here's a simplified version of the original loop:
+
+```rust
+let mut out = new_null_array(&return_type, batch.num_rows());
+let mut remainder = BooleanArray::from(vec![true; batch.num_rows()]);
+
+for (when_expr, then_expr) in &self.when_then_expr {
+    // Determine for which remaining rows the WHEN condition matches
+    let when = when_expr.evaluate_selection(batch, &remainder)?
+        .into_array(batch.num_rows())?;
+    // Ensure any `NULL` values are treated as false
+    let when_and_rem = and(&when, &remainder)?;
+
+    if when_and_rem.true_count() == 0 {
+        continue;
+    }
+
+    // Evaluate the THEN expression for matching rows
+    let then = then_expr.evaluate_selection(batch, &when_and_rem)?;
+    // Merge results into output array
+    out = zip(&when_and_rem, &then_value, &out)?;
+    // Update remainder mask to exclude matched rows
+    remainder = and_not(&remainder, &when_and_rem)?;
+}
+```
+
+Schematically, one iteration of this loop for the case expression
+
+```sql
+CASE
+    WHEN col = 'b' THEN 100
+    ELSE 200
+END
+```
+
+looks like this:
+
+<figure>
+<img src="/blog/images/case/original_loop.svg" alt="Schematic representation 
of data flow in the original CASE implementation" width="100%" 
class="img-responsive">
+<figcaption>One iteration of the `CASE` evaluation loop</figcaption>
+</figure>
+
+While correct, this implementation has significant room for optimization, 
mostly related to the usage of `evaluate_selection`.
+To understand why, we need to dig a little deeper into the implementation of 
that function.
+Here's a simplified version of it that captures the relevant parts:
+
+```rust
+pub trait PhysicalExpr {
+    fn evaluate_selection(
+        &self,
+        batch: &RecordBatch,
+        selection: &BooleanArray,
+    ) -> Result<ColumnarValue> {
+        // Reduce record batch to only include rows that match selection
+        let filtered_batch = filter_record_batch(batch, selection)?;
+        // Perform regular evaluation on filtered batch
+        let filtered_result = self.evaluate(&filtered_batch)?;
+        // Expand result array to match original batch length
+        scatter(selection, filtered_result)
+    }
+}
+```
+
+Going back to the same example as before, the data flow looks like this:
+
+<figure>
+<img src="/blog/images/case/evaluate_selection.svg" alt="Schematic 
representation of `evaluate_selection` evaluation" width="100%" 
class="img-responsive">
+<figcaption>evaluate_selection data flow</figcaption>
+</figure>
+
+The `evaluate_selection` method first filters the input batch to only include 
rows that match the `selection` mask.
+It then calls the regular `evaluate` method using the filtered batch as input.
+Finally, to return a result array with the same number of rows as `batch`, the 
`scatter` function is called.
+This function produces a new array padded with `null` values for any rows that 
didn't match the `selection` mask.
+
+So how does the simple evaluation strategy and use of `evaluate_selection` 
cause performance overhead?
+
+### Observation 1: No Early Exit
+
+The case evaluation loop always iterated through all branches, even when every 
row had already been matched.
+In queries where early branches match many rows, this meant unnecessary work 
was done for remaining rows.
+
+### Observation 2: Repeated Filtering, Scattering, and Merging
+
+Each iteration performed a number of operations that are very well-optimized, 
but still take up a significant amount of CPU time:
+
+- **Filtering**: `PhysicalExpr::evaluate_selection` filters the entire 
`RecordBatch` for each branch. For the `WHEN` expression, this was done even if 
the selection mask was entirely empty.
+- **Scattering**: `PhysicalExpr::evaluate_selection` scatters the filtered 
result back to the original `RecordBatch` length.
+- **Merging**: The `zip` kernel is called once per branch to merge partial 
results into the output array
+
+Each of these operations needs to allocate memory for new arrays and shuffle 
quite a bit of data around. 
+
+### Observation 3: Filtering Unused Columns
+
+The `PhysicalExpr::evaluate_selection` method filters the entire record batch, 
including columns that the current branch's `WHEN` and `THEN` expressions don't 
reference.
+For wide tables (many columns) with narrow expressions (few column 
references), this is wasteful.
+
+Suppose we have a table with 26 columns named `a` through `z`.
+For a simple CASE expression like:
+
+```sql
+CASE
+  WHEN a > 1000 THEN 'large'
+  WHEN a >= 0 THEN 'positive'
+  ELSE 'negative'
+END
+```
+
+the implementation would filter all 26 columns even though only a single 
column is needed for the entire `CASE` expression evaluation.
+Again this involves a non-negligible amount of allocation and data copying.
+
+## Performance Optimizations
+
+### Optimization 1: Short-Circuit Early Exit
+
+The first optimization is an easy one.
+As soon as we can detect that all rows of the batch have been matched we break 
out of the evaluation loop:
+
+```rust
+let mut remainder_count = batch.num_rows();
+
+for (when_expr, then_expr) in &self.when_then_expr {
+    if remainder_count == 0 {
+        break;  // All rows matched, exit early
+    }
+
+    // ... evaluate branch ...
+
+    let when_match_count = when_value.true_count();
+    remainder_count -= when_match_count;
+}
+```
+
+Additionally, we avoid evaluating the `ELSE` clause when no rows remain:
+
+```rust
+if let Some(else_expr) = &self.else_expr {
+    remainder = or(&base_nulls, &remainder)?;
+    if remainder.true_count() > 0 {
+        // ... evaluate else ...
+    }
+}
+```
+
+For queries where early branches match all rows, this eliminates unnecessary 
branch evaluations and `ELSE` clause processing.
+
+This optimization was implemented by Pepijn Van Eeckhoudt 
([`@pepijnve`](https://github.com/pepijnve)) in [PR 
#17898](https://github.com/apache/datafusion/pull/17898)
+
+### Optimization 2: Optimized Result Merging
+
+The second optimization fundamentally restructured how the results of each 
loop iteration are merged.
+The diagram below illustrates the optimized data flow when evaluating the 
`CASE WHEN col = 'b' THEN 100 ELSE 200 END` from before:
+
+<figure>

Review Comment:
   Thanks for spotting that. Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to