GitHub user zhuqi-lucas added a comment to the discussion: how to run tpch
benchmark datafusion
It isn’t a bug in DataFusion so much as in how the TPCH benchmark runner
expects your data laid out. By default it will look under your --path for one
directory per table (named exactly after the table), and then inside that
directory expect one or more Parquet files. What you have today is a flat
directory of files:
```rust
/par/tpch/sf4-parquet/
├─ customer.parquet
├─ lineitem.parquet
├─ nation.parquet
├─ orders.parquet
├─ part.parquet
├─ partsupp.parquet
├─ region.parquet
└─ supplier.parquet
```
When it tries to read table part it literally does a list() on
/par/tpch/sf4-parquet/part (i.e. a directory), which doesn’t exist, hence the
“NotFound … path: …/part” error.
A easy way to fix it:
```rust
cd /par/tpch/sf4-parquet
for tbl in customer lineitem nation orders part partsupp region supplier; do
mkdir -p "$tbl"
mv "${tbl}.parquet" "$tbl/"
done
```
Or you can using datafusion command to generate the tpch data:
https://github.com/apache/datafusion/blob/main/benchmarks/README.md
```rust
./bench.sh data tpch
```
GitHub link:
https://github.com/apache/datafusion/discussions/16598#discussioncomment-13603469
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]