Re: [PR] feat: generate reversed-name data for sort pushdown benchmark [datafusion]

via GitHub Mon, 30 Mar 2026 19:47:39 -0700


Copilot commented on code in PR #21266:
URL: https://github.com/apache/datafusion/pull/21266#discussion_r3013156152



##########
benchmarks/bench.sh:
##########
@@ -1085,19 +1084,80 @@ run_external_aggr() {
 }
 
 # Runs the sort pushdown benchmark (without WITH ORDER)
-run_sort_pushdown() {
+# Generates sort pushdown benchmark data: multiple sorted parquet files with
+# reversed naming so alphabetical order does NOT match sort key order.
+# This forces the sort pushdown optimizer to reorder files by statistics.
+#
+# Files created (l_orderkey ranges):
+#   c_high.parquet: 1-200000       (c sorts last alphabetically, but has 
lowest keys)
+#   b_mid.parquet:  200001-400000
+#   a_low.parquet:  400001-600001  (a sorts first alphabetically, but has 
highest keys)
+data_sort_pushdown() {
+    SORT_PUSHDOWN_DIR="${DATA_DIR}/sort_pushdown/lineitem"
+    if [ -d "${SORT_PUSHDOWN_DIR}" ] && [ "$(ls -A 
${SORT_PUSHDOWN_DIR}/*.parquet 2>/dev/null)" ]; then
+        echo "Sort pushdown data already exists at ${SORT_PUSHDOWN_DIR}"
+        return
+    fi
+
+    echo "Generating sort pushdown benchmark data..."
+
+    # First ensure we have TPC-H data to work with
+    data_tpch "1" "parquet"
+
     TPCH_DIR="${DATA_DIR}/tpch_sf1"
+    mkdir -p "${SORT_PUSHDOWN_DIR}"
+
+    # Build datafusion-cli if needed
+    $CARGO_COMMAND --bin datafusion-cli
+
+    DATAFUSION_CLI="./target/release/datafusion-cli"
+    if [ ! -f "$DATAFUSION_CLI" ]; then
+        DATAFUSION_CLI="./target/debug/datafusion-cli"

Review Comment:
   `$CARGO_COMMAND` defaults to `cargo run --release`, so `$CARGO_COMMAND --bin 
datafusion-cli` will *execute* the CLI (potentially starting an interactive 
session) rather than just building it. Also, the subsequent 
`./target/{release,debug}/datafusion-cli` paths are relative to `benchmarks/`, 
but Cargo outputs to `${DATAFUSION_DIR}/target/` by default, so this lookup 
will fail. Suggest building explicitly (e.g. `cargo build --release --bin 
datafusion-cli` in `${DATAFUSION_DIR}`) and setting `DATAFUSION_CLI` to 
`${DATAFUSION_DIR}/target/...`, then invoke it via `"${DATAFUSION_CLI}"`.
   ```suggestion
       (cd "$DATAFUSION_DIR" && cargo build --release --bin datafusion-cli)
   
       DATAFUSION_CLI="${DATAFUSION_DIR}/target/release/datafusion-cli"
       if [ ! -f "$DATAFUSION_CLI" ]; then
           DATAFUSION_CLI="${DATAFUSION_DIR}/target/debug/datafusion-cli"
   ```



##########
benchmarks/bench.sh:
##########
@@ -1085,19 +1084,80 @@ run_external_aggr() {
 }
 
 # Runs the sort pushdown benchmark (without WITH ORDER)
-run_sort_pushdown() {
+# Generates sort pushdown benchmark data: multiple sorted parquet files with
+# reversed naming so alphabetical order does NOT match sort key order.
+# This forces the sort pushdown optimizer to reorder files by statistics.
+#
+# Files created (l_orderkey ranges):
+#   c_high.parquet: 1-200000       (c sorts last alphabetically, but has 
lowest keys)
+#   b_mid.parquet:  200001-400000
+#   a_low.parquet:  400001-600001  (a sorts first alphabetically, but has 
highest keys)
+data_sort_pushdown() {
+    SORT_PUSHDOWN_DIR="${DATA_DIR}/sort_pushdown/lineitem"
+    if [ -d "${SORT_PUSHDOWN_DIR}" ] && [ "$(ls -A 
${SORT_PUSHDOWN_DIR}/*.parquet 2>/dev/null)" ]; then

Review Comment:
   The existence check uses `ls` with an unquoted glob inside `$(...)` (`ls -A 
${SORT_PUSHDOWN_DIR}/*.parquet`), which is brittle (breaks on spaces, and can 
behave oddly when the glob doesn’t match). Prefer a glob check that doesn’t 
invoke `ls`, e.g. `compgen -G "${SORT_PUSHDOWN_DIR}/*.parquet" > /dev/null` (or 
`find ... -name '*.parquet' -print -quit`).
   ```suggestion
       if [ -d "${SORT_PUSHDOWN_DIR}" ] && compgen -G 
"${SORT_PUSHDOWN_DIR}"/*.parquet > /dev/null; then
   ```



##########
benchmarks/bench.sh:
##########
@@ -1085,19 +1084,80 @@ run_external_aggr() {
 }
 
 # Runs the sort pushdown benchmark (without WITH ORDER)
-run_sort_pushdown() {
+# Generates sort pushdown benchmark data: multiple sorted parquet files with
+# reversed naming so alphabetical order does NOT match sort key order.
+# This forces the sort pushdown optimizer to reorder files by statistics.
+#
+# Files created (l_orderkey ranges):
+#   c_high.parquet: 1-200000       (c sorts last alphabetically, but has 
lowest keys)
+#   b_mid.parquet:  200001-400000
+#   a_low.parquet:  400001-600001  (a sorts first alphabetically, but has 
highest keys)
+data_sort_pushdown() {

Review Comment:
   The documented l_orderkey ranges don’t match the actual predicates: the 
comment says `a_low.parquet: 400001-600001`, but the query uses `WHERE 
l_orderkey > 400000` (no upper bound), which will include the rest of the 
dataset. Either update the comment to reflect the actual range or add an upper 
bound so the generated files match the stated ranges.



##########
benchmarks/bench.sh:
##########
@@ -1085,19 +1084,80 @@ run_external_aggr() {
 }
 
 # Runs the sort pushdown benchmark (without WITH ORDER)
-run_sort_pushdown() {
+# Generates sort pushdown benchmark data: multiple sorted parquet files with
+# reversed naming so alphabetical order does NOT match sort key order.
+# This forces the sort pushdown optimizer to reorder files by statistics.
+#
+# Files created (l_orderkey ranges):
+#   c_high.parquet: 1-200000       (c sorts last alphabetically, but has 
lowest keys)
+#   b_mid.parquet:  200001-400000
+#   a_low.parquet:  400001-600001  (a sorts first alphabetically, but has 
highest keys)
+data_sort_pushdown() {
+    SORT_PUSHDOWN_DIR="${DATA_DIR}/sort_pushdown/lineitem"
+    if [ -d "${SORT_PUSHDOWN_DIR}" ] && [ "$(ls -A 
${SORT_PUSHDOWN_DIR}/*.parquet 2>/dev/null)" ]; then
+        echo "Sort pushdown data already exists at ${SORT_PUSHDOWN_DIR}"
+        return
+    fi
+
+    echo "Generating sort pushdown benchmark data..."
+
+    # First ensure we have TPC-H data to work with
+    data_tpch "1" "parquet"
+
     TPCH_DIR="${DATA_DIR}/tpch_sf1"
+    mkdir -p "${SORT_PUSHDOWN_DIR}"
+
+    # Build datafusion-cli if needed
+    $CARGO_COMMAND --bin datafusion-cli
+
+    DATAFUSION_CLI="./target/release/datafusion-cli"
+    if [ ! -f "$DATAFUSION_CLI" ]; then
+        DATAFUSION_CLI="./target/debug/datafusion-cli"
+    fi
+
+    echo "Creating sorted parquet files with reversed naming..."
+
+    $DATAFUSION_CLI --command "
+        CREATE EXTERNAL TABLE lineitem
+        STORED AS PARQUET
+        LOCATION '${TPCH_DIR}/lineitem/';
+
+        COPY (
+            SELECT * FROM lineitem
+            WHERE l_orderkey <= 200000
+            ORDER BY l_orderkey ASC
+        ) TO '${SORT_PUSHDOWN_DIR}/c_high.parquet';

Review Comment:
   The generated SQL relies on `COPY ... TO '<path>.parquet'` producing exactly 
one parquet file per range. If `datafusion.execution.target_partitions` is > 1, 
COPY may emit multiple output files/parts, which can defeat the intended “3 
files with reversed names” setup. Consider setting `SET 
datafusion.execution.target_partitions = 1;` (and optionally other output 
settings) at the start of the `--command` SQL, similar to how sqllogictests 
force a single file group for ordering tests.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: generate reversed-name data for sort pushdown benchmark [datafusion]

Reply via email to