GitHub user zhuqi-lucas added a comment to the discussion: how to run tpch 
benchmark datafusion

It isn’t a bug in DataFusion so much as in how the TPCH benchmark runner 
expects your data laid out. By default it will look under your --path for one 
directory per table (named exactly after the table), and then inside that 
directory expect one or more Parquet files. What you have today is a flat 
directory of files:

```rust
/par/tpch/sf4-parquet/
├─ customer.parquet
├─ lineitem.parquet
├─ nation.parquet
├─ orders.parquet
├─ part.parquet
├─ partsupp.parquet
├─ region.parquet
└─ supplier.parquet
```

When it tries to read table part it literally does a list() on 
/par/tpch/sf4-parquet/part (i.e. a directory), which doesn’t exist, hence the 
“NotFound … path: …/part” error.




A easy way to fix it:

```rust
cd /par/tpch/sf4-parquet

for tbl in customer lineitem nation orders part partsupp region supplier; do
  mkdir -p "$tbl"
  mv "${tbl}.parquet" "$tbl/"
done
```


Or you can using datafusion command to generate the tpch data:

https://github.com/apache/datafusion/blob/main/benchmarks/README.md

```rust
./bench.sh data tpch
```



GitHub link: 
https://github.com/apache/datafusion/discussions/16598#discussioncomment-13603469

----
This is an automatically sent email for github@datafusion.apache.org.
To unsubscribe, please send an email to: 
github-unsubscr...@datafusion.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to