alamb commented on code in PR #19035:
URL: https://github.com/apache/datafusion/pull/19035#discussion_r2582593478
##########
benchmarks/bench.sh:
##########
@@ -548,20 +544,19 @@ data_tpch() {
echo "Internal error: Scale factor not specified"
exit 1
fi
+ FORMAT=$2
Review Comment:
As I understand, that would default the argument to `parquet` -- what is the
rationale for doing so?
##########
benchmarks/bench.sh:
##########
@@ -611,10 +611,10 @@ run_tpch() {
echo "Running tpch benchmark..."
FORMAT=$2
Review Comment:
similarly to above I am not sure we want to default to parquet
##########
benchmarks/bench.sh:
##########
@@ -574,27 +569,32 @@ data_tpch() {
docker run -v "${TPCH_DIR}":/data -it --entrypoint /bin/bash --rm
ghcr.io/scalytics/tpch-docker:main -c "cp -f
/opt/tpch/2.18.0_rc2/dbgen/answers/* /data/answers/"
fi
- # Create 'parquet' files from tbl
- FILE="${TPCH_DIR}/supplier"
- if test -d "${FILE}"; then
- echo " parquet files exist ($FILE exists)."
- else
- echo " creating parquet files using benchmark binary ..."
- pushd "${SCRIPT_DIR}" > /dev/null
- $CARGO_COMMAND --bin tpch -- convert --input "${TPCH_DIR}" --output
"${TPCH_DIR}" --format parquet
- popd > /dev/null
+ if [ "$FORMAT" = "parquet" ]; then
+ # Create 'parquet' files, one directory per file
+ FILE="${TPCH_DIR}/supplier"
+ if test -d "${FILE}"; then
+ echo " parquet files exist ($FILE exists)."
+ else
+ echo " creating parquet files using tpchgen-cli ..."
+ tpchgen-cli --scale-factor "${SCALE_FACTOR}" --format parquet
--parquet-compression='ZSTD(1)' --parts=1 --output-dir "${TPCH_DIR}"
Review Comment:
This is a good question. I don't think there is any way we have
parameterized the benchmark scripts to use different compression, so in other
words there is no user visible way to set a different compression
I used Zstd to mirror what the existing conversion code did.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]