kevinjqliu commented on issue #14608: URL: https://github.com/apache/datafusion/issues/14608#issuecomment-2816875067
> In order to try and make progress on this, I decided to go with having a single function that builds all tables for a single scale factor similar to how DuckDB does it. My reasoning is that this approach is both simpler to implement and is more useful (benchmark queries need multiple tables). +1 thats a good place to start! I want to add some context to the single function (`CALL dbgen(sf = 1);`) vs table functions (`tpch.lineitem(1)`) discussion. Duckdb generates all 8 tcph tables with a single function, but it is saved internally as duckdb tables. Duckdb's [`EXPORT DATABASE` ](https://duckdb.org/docs/stable/sql/statements/export.html) method can then write out all 8 tables as files. `EXPORT DATABASE` has a number of ways to configure the output file such as file format, compression, row group size, etc. I can also save individual tables using [`COPY`](https://duckdb.org/docs/stable/sql/statements/copy), but i would have wasted CPU cycles generating the other tables. For datafusion-cli, it would be great to be able to generate individual tables (we can already do this with `tpchgen-cil -T`) and also be able to specify how the table should be stored physically on disk (currently `tpchgen-cli` only supports parquet compression, `tpchgen-cli --format parquet --parquet-compression zstd`) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org