adriangb opened a new pull request, #22836:
URL: https://github.com/apache/datafusion/pull/22836
## Which issue does this PR close?
- Relates to #21968 (wide-schema benchmark work).
## Rationale for this change
Every `sql_benchmarks` suite declares its external-table `LOCATION` as
`${DATA_DIR:-data}/...`, and the matching `bench.sh run_*` forwards
`DATA_DIR="${DATA_DIR}"` to the benchmark process — e.g. `clickbench`, `imdb`,
`tpch`, `push_down_topk`, `h2o` all do this.
`wide_schema` (added in #21970) diverged on **both** counts:
- `wide_schema/init/load.sql` hardcoded `LOCATION 'data/wide_schema/...'`
(no `${DATA_DIR}`)
- `run_wide_schema` did not forward `DATA_DIR`
Run from inside the repo this happens to work, because the process CWD is
the repo's `benchmarks/` and the relative `data/` path resolves. But it breaks
whenever the data directory is not under the current directory — for example
the CI benchmark runner builds/runs the benchmark from a separate source
checkout while staging the dataset elsewhere and pointing at it via `DATA_DIR`.
There, wide_schema fails at load:
```
initialization failed: No files found at
file:///.../benchmarks/data/wide_schema/wide/.
Cannot infer schema from an empty location; either add data files or declare
an explicit schema for the table.
```
(The `bench.sh` variable `DATA_DIR` is set but not exported, so it only
reaches the benchmark when a `run_*` function passes it explicitly — which
`run_tpch` etc. do and `run_wide_schema` did not.)
## What changes are included in this PR?
Two one-line fixes bringing `wide_schema` in line with the convention used
by every other suite:
- `load.sql`: `data/wide_schema/...` → `${DATA_DIR:-data}/wide_schema/...`
- `run_wide_schema`: forward `DATA_DIR="${DATA_DIR}"` for both the `wide`
and `narrow` subgroups
## Are these changes tested?
Benchmark tooling only. The substitution path is the same one every other
suite already relies on: the SQL harness's `process_replacements` resolves
`${DATA_DIR:-data}` from the env (falling back to `data`), and
`run_wide_schema` now forwards `DATA_DIR` exactly as `run_tpch` does. Verified
`bench.sh` still parses (`bash -n`). Default in-repo behavior is unchanged
(CWD-relative `data/`).
## Are there any user-facing changes?
No. Benchmark harness only.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]