Copilot commented on code in PR #21266:
URL: https://github.com/apache/datafusion/pull/21266#discussion_r3013156152
##########
benchmarks/bench.sh:
##########
@@ -1085,19 +1084,80 @@ run_external_aggr() {
}
# Runs the sort pushdown benchmark (without WITH ORDER)
-run_sort_pushdown() {
+# Generates sort pushdown benchmark data: multiple sorted parquet files with
+# reversed naming so alphabetical order does NOT match sort key order.
+# This forces the sort pushdown optimizer to reorder files by statistics.
+#
+# Files created (l_orderkey ranges):
+# c_high.parquet: 1-200000 (c sorts last alphabetically, but has
lowest keys)
+# b_mid.parquet: 200001-400000
+# a_low.parquet: 400001-600001 (a sorts first alphabetically, but has
highest keys)
+data_sort_pushdown() {
+ SORT_PUSHDOWN_DIR="${DATA_DIR}/sort_pushdown/lineitem"
+ if [ -d "${SORT_PUSHDOWN_DIR}" ] && [ "$(ls -A
${SORT_PUSHDOWN_DIR}/*.parquet 2>/dev/null)" ]; then
+ echo "Sort pushdown data already exists at ${SORT_PUSHDOWN_DIR}"
+ return
+ fi
+
+ echo "Generating sort pushdown benchmark data..."
+
+ # First ensure we have TPC-H data to work with
+ data_tpch "1" "parquet"
+
TPCH_DIR="${DATA_DIR}/tpch_sf1"
+ mkdir -p "${SORT_PUSHDOWN_DIR}"
+
+ # Build datafusion-cli if needed
+ $CARGO_COMMAND --bin datafusion-cli
+
+ DATAFUSION_CLI="./target/release/datafusion-cli"
+ if [ ! -f "$DATAFUSION_CLI" ]; then
+ DATAFUSION_CLI="./target/debug/datafusion-cli"
Review Comment:
`$CARGO_COMMAND` defaults to `cargo run --release`, so `$CARGO_COMMAND --bin
datafusion-cli` will *execute* the CLI (potentially starting an interactive
session) rather than just building it. Also, the subsequent
`./target/{release,debug}/datafusion-cli` paths are relative to `benchmarks/`,
but Cargo outputs to `${DATAFUSION_DIR}/target/` by default, so this lookup
will fail. Suggest building explicitly (e.g. `cargo build --release --bin
datafusion-cli` in `${DATAFUSION_DIR}`) and setting `DATAFUSION_CLI` to
`${DATAFUSION_DIR}/target/...`, then invoke it via `"${DATAFUSION_CLI}"`.
```suggestion
(cd "$DATAFUSION_DIR" && cargo build --release --bin datafusion-cli)
DATAFUSION_CLI="${DATAFUSION_DIR}/target/release/datafusion-cli"
if [ ! -f "$DATAFUSION_CLI" ]; then
DATAFUSION_CLI="${DATAFUSION_DIR}/target/debug/datafusion-cli"
```
##########
benchmarks/bench.sh:
##########
@@ -1085,19 +1084,80 @@ run_external_aggr() {
}
# Runs the sort pushdown benchmark (without WITH ORDER)
-run_sort_pushdown() {
+# Generates sort pushdown benchmark data: multiple sorted parquet files with
+# reversed naming so alphabetical order does NOT match sort key order.
+# This forces the sort pushdown optimizer to reorder files by statistics.
+#
+# Files created (l_orderkey ranges):
+# c_high.parquet: 1-200000 (c sorts last alphabetically, but has
lowest keys)
+# b_mid.parquet: 200001-400000
+# a_low.parquet: 400001-600001 (a sorts first alphabetically, but has
highest keys)
+data_sort_pushdown() {
+ SORT_PUSHDOWN_DIR="${DATA_DIR}/sort_pushdown/lineitem"
+ if [ -d "${SORT_PUSHDOWN_DIR}" ] && [ "$(ls -A
${SORT_PUSHDOWN_DIR}/*.parquet 2>/dev/null)" ]; then
Review Comment:
The existence check uses `ls` with an unquoted glob inside `$(...)` (`ls -A
${SORT_PUSHDOWN_DIR}/*.parquet`), which is brittle (breaks on spaces, and can
behave oddly when the glob doesn’t match). Prefer a glob check that doesn’t
invoke `ls`, e.g. `compgen -G "${SORT_PUSHDOWN_DIR}/*.parquet" > /dev/null` (or
`find ... -name '*.parquet' -print -quit`).
```suggestion
if [ -d "${SORT_PUSHDOWN_DIR}" ] && compgen -G
"${SORT_PUSHDOWN_DIR}"/*.parquet > /dev/null; then
```
##########
benchmarks/bench.sh:
##########
@@ -1085,19 +1084,80 @@ run_external_aggr() {
}
# Runs the sort pushdown benchmark (without WITH ORDER)
-run_sort_pushdown() {
+# Generates sort pushdown benchmark data: multiple sorted parquet files with
+# reversed naming so alphabetical order does NOT match sort key order.
+# This forces the sort pushdown optimizer to reorder files by statistics.
+#
+# Files created (l_orderkey ranges):
+# c_high.parquet: 1-200000 (c sorts last alphabetically, but has
lowest keys)
+# b_mid.parquet: 200001-400000
+# a_low.parquet: 400001-600001 (a sorts first alphabetically, but has
highest keys)
+data_sort_pushdown() {
Review Comment:
The documented l_orderkey ranges don’t match the actual predicates: the
comment says `a_low.parquet: 400001-600001`, but the query uses `WHERE
l_orderkey > 400000` (no upper bound), which will include the rest of the
dataset. Either update the comment to reflect the actual range or add an upper
bound so the generated files match the stated ranges.
##########
benchmarks/bench.sh:
##########
@@ -1085,19 +1084,80 @@ run_external_aggr() {
}
# Runs the sort pushdown benchmark (without WITH ORDER)
-run_sort_pushdown() {
+# Generates sort pushdown benchmark data: multiple sorted parquet files with
+# reversed naming so alphabetical order does NOT match sort key order.
+# This forces the sort pushdown optimizer to reorder files by statistics.
+#
+# Files created (l_orderkey ranges):
+# c_high.parquet: 1-200000 (c sorts last alphabetically, but has
lowest keys)
+# b_mid.parquet: 200001-400000
+# a_low.parquet: 400001-600001 (a sorts first alphabetically, but has
highest keys)
+data_sort_pushdown() {
+ SORT_PUSHDOWN_DIR="${DATA_DIR}/sort_pushdown/lineitem"
+ if [ -d "${SORT_PUSHDOWN_DIR}" ] && [ "$(ls -A
${SORT_PUSHDOWN_DIR}/*.parquet 2>/dev/null)" ]; then
+ echo "Sort pushdown data already exists at ${SORT_PUSHDOWN_DIR}"
+ return
+ fi
+
+ echo "Generating sort pushdown benchmark data..."
+
+ # First ensure we have TPC-H data to work with
+ data_tpch "1" "parquet"
+
TPCH_DIR="${DATA_DIR}/tpch_sf1"
+ mkdir -p "${SORT_PUSHDOWN_DIR}"
+
+ # Build datafusion-cli if needed
+ $CARGO_COMMAND --bin datafusion-cli
+
+ DATAFUSION_CLI="./target/release/datafusion-cli"
+ if [ ! -f "$DATAFUSION_CLI" ]; then
+ DATAFUSION_CLI="./target/debug/datafusion-cli"
+ fi
+
+ echo "Creating sorted parquet files with reversed naming..."
+
+ $DATAFUSION_CLI --command "
+ CREATE EXTERNAL TABLE lineitem
+ STORED AS PARQUET
+ LOCATION '${TPCH_DIR}/lineitem/';
+
+ COPY (
+ SELECT * FROM lineitem
+ WHERE l_orderkey <= 200000
+ ORDER BY l_orderkey ASC
+ ) TO '${SORT_PUSHDOWN_DIR}/c_high.parquet';
Review Comment:
The generated SQL relies on `COPY ... TO '<path>.parquet'` producing exactly
one parquet file per range. If `datafusion.execution.target_partitions` is > 1,
COPY may emit multiple output files/parts, which can defeat the intended “3
files with reversed names” setup. Consider setting `SET
datafusion.execution.target_partitions = 1;` (and optionally other output
settings) at the start of the `--command` SQL, similar to how sqllogictests
force a single file group for ordering tests.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]