Re: [I] Make it easier to run TPCH queries with datafusion-cli [datafusion]

via GitHub Sat, 19 Apr 2025 14:36:25 -0700


kevinjqliu commented on issue #14608:
URL: https://github.com/apache/datafusion/issues/14608#issuecomment-2816875067


   > In order to try and make progress on this, I decided to go with having a 
single function that builds all tables for a single scale factor similar to how 
DuckDB does it. My reasoning is that this approach is both simpler to implement 
and is more useful (benchmark queries need multiple tables). 
   
   +1 thats a good place to start! 
   
   
   I want to add some context to the single function (`CALL dbgen(sf = 1);`) vs 
table functions (`tpch.lineitem(1)`) discussion. 
   
   Duckdb generates all 8 tcph tables with a single function, but it is saved 
internally as duckdb tables. Duckdb's [`EXPORT DATABASE` 
](https://duckdb.org/docs/stable/sql/statements/export.html) method can then 
write out all 8 tables as files. `EXPORT DATABASE` has a number of ways to 
configure the output file such as file format, compression, row group size, 
etc. 
   I can also save individual tables using 
[`COPY`](https://duckdb.org/docs/stable/sql/statements/copy), but i would have 
wasted CPU cycles generating the other tables. 
   
   For datafusion-cli, it would be great to be able to generate individual 
tables (we can already do this with `tpchgen-cil -T`) and also be able to 
specify how the table should be stored physically on disk (currently 
`tpchgen-cli` only supports parquet compression, `tpchgen-cli --format parquet 
--parquet-compression zstd`)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Make it easier to run TPCH queries with datafusion-cli [datafusion]

Reply via email to