andygrove opened a new issue, #1901:
URL: https://github.com/apache/datafusion-ballista/issues/1901

   ## Describe the bug
   
   Client-set DataFusion session config (the `datafusion.*` namespace) is 
ignored by scheduler-side query planning, while `ballista.*` settings in the 
same session do take effect.
   
   ## To Reproduce
   
   Run any TPC-H query with AQE enabled and a `datafusion.optimizer` override, 
e.g. via the tpch benchmark:
   
   ```
   cargo run --release --bin tpch -- benchmark ballista \
     --host localhost --port 50050 \
     --query 5 --path <data> --format parquet --partitions 8 \
     -c ballista.planner.adaptive.enabled=true \
     -c datafusion.optimizer.hash_join_single_partition_threshold=10485760 \
     -c datafusion.optimizer.hash_join_single_partition_threshold_rows=1000000
   ```
   
   ## Expected behavior
   
   The AQE join resolver (`DynamicJoinSelectionExec::to_actual_join`) should 
see the overridden threshold and promote small build sides to `CollectLeft`.
   
   ## Actual behavior
   
   The resolver still sees the threshold as `0` — every join resolves to 
`Partitioned`, no `CollectLeft`, even for single-row build sides. The 
`datafusion.*` override has no effect. In the **same command**, the 
`ballista.planner.adaptive.enabled=true` override **does** take effect (AQE 
engages), so the session settings are reaching the scheduler — only the 
`datafusion.*` ones are being dropped.
   
   ## Additional context
   
   The obvious path looks correct end-to-end:
   
   - `SessionConfigHelperExt::to_key_value_pairs` 
(ballista/core/src/extension.rs) serializes **all** config entries, including 
`datafusion.*`.
   - `create_update_session` (ballista/scheduler/src/scheduler_server/grpc.rs) 
calls `produce_config()` then `update_from_key_value_pair(&settings)`, applying 
the client overrides on top of the scheduler base config.
   
   So the override reaches the stored session config, but the value appears to 
be lost when the per-job planning `SessionState` is built — the suspicion is 
that Ballista's session defaults (`upgrade_for_ballista`, which sets 
`hash_join_single_partition_threshold = 0`) are re-applied over the client 
config. `ballista.*` extension config survives because it lives in the 
`BallistaConfig` extension rather than in `ConfigOptions.optimizer`. (Suspected 
cause — not yet confirmed to a specific line.)
   
   **Impact:** users cannot tune DataFusion optimizer behavior per session; 
changing these thresholds currently requires recompiling the defaults (see 
#1770 / PR #1900).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to