kosiew commented on code in PR #20664:
URL: https://github.com/apache/datafusion/pull/20664#discussion_r2894330721
##########
datafusion/core/benches/sql_planner_extended.rs:
##########
@@ -212,21 +213,143 @@ fn build_test_data_frame(ctx: &SessionContext, rt:
&Runtime) -> DataFrame {
})
}
-fn criterion_benchmark(c: &mut Criterion) {
+/// Build a CASE-heavy dataframe over a non-inner join to stress
+/// planner-time filter pushdown and nullability/type inference.
+fn build_case_heavy_left_join_df(ctx: &SessionContext, rt: &Runtime) ->
DataFrame {
+ register_string_table(ctx, 100, 1000);
+ let query = build_case_heavy_left_join_query(30, 1);
+ rt.block_on(async { ctx.sql(&query).await.unwrap() })
+}
+
+fn build_case_heavy_left_join_query(predicate_count: usize, case_depth: usize)
-> String {
+ let mut query = String::from(
+ "SELECT l.c0, r.c0 AS rc0 FROM t l LEFT JOIN t r ON l.c0 = r.c0 WHERE
",
+ );
+
+ if predicate_count == 0 {
+ query.push_str("TRUE");
+ return query;
+ }
+
+ // Keep this deterministic so comparisons between profiles are stable.
+ for i in 0..predicate_count {
+ if i > 0 {
+ query.push_str(" AND ");
+ }
+
+ let mut expr = format!("length(l.c{})", i % 20);
Review Comment:
> don't think we really want a benchmark for the case expression.
We want to optimize for the evaluation cost of the filter during pushdown,
so perhaps it could be written not using a large case expression as is done
currently or adaptive removing filters, etc.
I believe that you fear that by using a huge CASE expression we might be
tuning for an unrealistic “case expression” workload instead of the more common
cost of pushing filters through joins.
The benchmark uses CASE because that form triggered a profiler hotspot in
`PushDownFilter` — the nullability/type‑inference codepath for filters on
non‑inner joins. I don’t believe real‑world queries typically look like this,
so the presence of CASE is purely a convenient way to exercise that particular
expensive planner path, not the target of optimization.
To make this clear and avoid overfitting, I’m going to treat the CASE
variant as a narrowly scoped micro‑benchmark and add a companion LEFT JOIN
query with a simple predicate instead of a CASE.
With both in place we can separate:
1. the baseline cost of pushing a filter through a join, and
2. the extra work incurred when a CASE expression forces nullability
inference.
That way the benchmark remains useful for optimization while still
reflecting more general planner behaviour.
> So the TPC-H/TPC-DS one is already a good one to optimize for.
Agreed. TPC-H/TPC-DS should remain the primary macro-level signal for
optimization value and regression detection.
The intent here is to complement those suites with a deterministic
micro-benchmark that isolates one known planner hotspot; macro benchmarks are
still required to verify end-to-end relevance and prevent narrow wins.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]