adriangb opened a new pull request, #22836:
URL: https://github.com/apache/datafusion/pull/22836

   ## Which issue does this PR close?
   
   - Relates to #21968 (wide-schema benchmark work).
   
   ## Rationale for this change
   
   Every `sql_benchmarks` suite declares its external-table `LOCATION` as 
`${DATA_DIR:-data}/...`, and the matching `bench.sh run_*` forwards 
`DATA_DIR="${DATA_DIR}"` to the benchmark process — e.g. `clickbench`, `imdb`, 
`tpch`, `push_down_topk`, `h2o` all do this.
   
   `wide_schema` (added in #21970) diverged on **both** counts:
   - `wide_schema/init/load.sql` hardcoded `LOCATION 'data/wide_schema/...'` 
(no `${DATA_DIR}`)
   - `run_wide_schema` did not forward `DATA_DIR`
   
   Run from inside the repo this happens to work, because the process CWD is 
the repo's `benchmarks/` and the relative `data/` path resolves. But it breaks 
whenever the data directory is not under the current directory — for example 
the CI benchmark runner builds/runs the benchmark from a separate source 
checkout while staging the dataset elsewhere and pointing at it via `DATA_DIR`. 
There, wide_schema fails at load:
   
   ```
   initialization failed: No files found at 
file:///.../benchmarks/data/wide_schema/wide/.
   Cannot infer schema from an empty location; either add data files or declare 
an explicit schema for the table.
   ```
   
   (The `bench.sh` variable `DATA_DIR` is set but not exported, so it only 
reaches the benchmark when a `run_*` function passes it explicitly — which 
`run_tpch` etc. do and `run_wide_schema` did not.)
   
   ## What changes are included in this PR?
   
   Two one-line fixes bringing `wide_schema` in line with the convention used 
by every other suite:
   - `load.sql`: `data/wide_schema/...` → `${DATA_DIR:-data}/wide_schema/...`
   - `run_wide_schema`: forward `DATA_DIR="${DATA_DIR}"` for both the `wide` 
and `narrow` subgroups
   
   ## Are these changes tested?
   
   Benchmark tooling only. The substitution path is the same one every other 
suite already relies on: the SQL harness's `process_replacements` resolves 
`${DATA_DIR:-data}` from the env (falling back to `data`), and 
`run_wide_schema` now forwards `DATA_DIR` exactly as `run_tpch` does. Verified 
`bench.sh` still parses (`bash -n`). Default in-repo behavior is unchanged 
(CWD-relative `data/`).
   
   ## Are there any user-facing changes?
   
   No. Benchmark harness only.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to