Re: [PR] Use `tpchgen-cli` to generate tpch data in bench.sh [datafusion]

via GitHub Tue, 02 Dec 2025 23:44:32 -0800


martin-g commented on code in PR #19035:
URL: https://github.com/apache/datafusion/pull/19035#discussion_r2583986975



##########
benchmarks/bench.sh:
##########
@@ -548,20 +544,19 @@ data_tpch() {
         echo "Internal error: Scale factor not specified"
         exit 1
     fi
+    FORMAT=$2

Review Comment:
   See https://github.com/apache/datafusion/pull/19035#discussion_r2579864175
   There are two calls of `data_tpch` there which do not pass the format.
   
   
https://github.com/apache/datafusion/pull/19035/files/907bce3e16352148eade3b7cf512091a9aab4232#diff-1769f5787dc11c8b1f1b48288cdf3c89d25a5b5cbc6be4740bfcc70a6313ba99R550
 will print `Creating tpch <EMPTY> dataset at Scale Factor`, where `<EMPTY>` is 
an empty string.
   
   And the third reason why I proposed `parquet` as default is:
   ```
   Also @comphead pointed out on 
https://github.com/apache/datafusion/pull/19034#pullrequestreview-3526952491 
that the bench.sh data tpch generated both csv and parquet files when it only 
really needs parquet.
   ```
   This sounds like parquet is the needed format most of the time.
   
   
   But data_h2o() uses CSV as a default format:
   
https://github.com/alamb/datafusion/blob/907bce3e16352148eade3b7cf512091a9aab4232/benchmarks/bench.sh#L853



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Use `tpchgen-cli` to generate tpch data in bench.sh [datafusion]

Reply via email to