kosiew commented on code in PR #20664:
URL: https://github.com/apache/datafusion/pull/20664#discussion_r2894330721


##########
datafusion/core/benches/sql_planner_extended.rs:
##########
@@ -212,21 +213,143 @@ fn build_test_data_frame(ctx: &SessionContext, rt: 
&Runtime) -> DataFrame {
     })
 }
 
-fn criterion_benchmark(c: &mut Criterion) {
+/// Build a CASE-heavy dataframe over a non-inner join to stress
+/// planner-time filter pushdown and nullability/type inference.
+fn build_case_heavy_left_join_df(ctx: &SessionContext, rt: &Runtime) -> 
DataFrame {
+    register_string_table(ctx, 100, 1000);
+    let query = build_case_heavy_left_join_query(30, 1);
+    rt.block_on(async { ctx.sql(&query).await.unwrap() })
+}
+
+fn build_case_heavy_left_join_query(predicate_count: usize, case_depth: usize) 
-> String {
+    let mut query = String::from(
+        "SELECT l.c0, r.c0 AS rc0 FROM t l LEFT JOIN t r ON l.c0 = r.c0 WHERE 
",
+    );
+
+    if predicate_count == 0 {
+        query.push_str("TRUE");
+        return query;
+    }
+
+    // Keep this deterministic so comparisons between profiles are stable.
+    for i in 0..predicate_count {
+        if i > 0 {
+            query.push_str(" AND ");
+        }
+
+        let mut expr = format!("length(l.c{})", i % 20);

Review Comment:
   >  don't think we really want a benchmark for the case expression.
   We want to optimize for the evaluation cost of the filter during pushdown, 
so perhaps it could be written not using a large case expression as is done 
currently or adaptive removing filters, etc.
   
   I believe that you fear that by using a huge CASE expression we might be 
tuning for an unrealistic “case expression” workload instead of the more common 
cost of pushing filters through joins. 
   
   The benchmark uses CASE because that form triggered a profiler hotspot in 
`PushDownFilter` — the nullability/type‑inference codepath for filters on 
non‑inner joins. I don’t believe real‑world queries typically look like this, 
so the presence of CASE is purely a convenient way to exercise that particular 
expensive planner path, not the target of optimization.
   
   To make this clear and avoid overfitting, I’m going to treat the CASE 
variant as a narrowly scoped micro‑benchmark and add a companion LEFT JOIN 
query with a simple predicate instead of a CASE. 
   With both in place we can separate:
   
   1. the baseline cost of pushing a filter through a join, and  
   2. the extra work incurred when a CASE expression forces nullability 
inference.
   
   That way the benchmark remains useful for optimization while still 
reflecting more general planner behaviour.
   
   > So the TPC-H/TPC-DS one is already a good one to optimize for.
   
   Agreed. TPC-H/TPC-DS should remain the primary macro-level signal for 
optimization value and regression detection.
   
   The intent here is to complement those suites with a deterministic 
micro-benchmark that isolates one known planner hotspot; macro benchmarks are 
still required to verify end-to-end relevance and prevent narrow wins.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to