andygrove opened a new issue, #1901:
URL: https://github.com/apache/datafusion-ballista/issues/1901
## Describe the bug
Client-set DataFusion session config (the `datafusion.*` namespace) is
ignored by scheduler-side query planning, while `ballista.*` settings in the
same session do take effect.
## To Reproduce
Run any TPC-H query with AQE enabled and a `datafusion.optimizer` override,
e.g. via the tpch benchmark:
```
cargo run --release --bin tpch -- benchmark ballista \
--host localhost --port 50050 \
--query 5 --path <data> --format parquet --partitions 8 \
-c ballista.planner.adaptive.enabled=true \
-c datafusion.optimizer.hash_join_single_partition_threshold=10485760 \
-c datafusion.optimizer.hash_join_single_partition_threshold_rows=1000000
```
## Expected behavior
The AQE join resolver (`DynamicJoinSelectionExec::to_actual_join`) should
see the overridden threshold and promote small build sides to `CollectLeft`.
## Actual behavior
The resolver still sees the threshold as `0` — every join resolves to
`Partitioned`, no `CollectLeft`, even for single-row build sides. The
`datafusion.*` override has no effect. In the **same command**, the
`ballista.planner.adaptive.enabled=true` override **does** take effect (AQE
engages), so the session settings are reaching the scheduler — only the
`datafusion.*` ones are being dropped.
## Additional context
The obvious path looks correct end-to-end:
- `SessionConfigHelperExt::to_key_value_pairs`
(ballista/core/src/extension.rs) serializes **all** config entries, including
`datafusion.*`.
- `create_update_session` (ballista/scheduler/src/scheduler_server/grpc.rs)
calls `produce_config()` then `update_from_key_value_pair(&settings)`, applying
the client overrides on top of the scheduler base config.
So the override reaches the stored session config, but the value appears to
be lost when the per-job planning `SessionState` is built — the suspicion is
that Ballista's session defaults (`upgrade_for_ballista`, which sets
`hash_join_single_partition_threshold = 0`) are re-applied over the client
config. `ballista.*` extension config survives because it lives in the
`BallistaConfig` extension rather than in `ConfigOptions.optimizer`. (Suspected
cause — not yet confirmed to a specific line.)
**Impact:** users cannot tune DataFusion optimizer behavior per session;
changing these thresholds currently requires recompiling the defaults (see
#1770 / PR #1900).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]