zhuqi-lucas commented on code in PR #21674:
URL: https://github.com/apache/datafusion/pull/21674#discussion_r3093849484
##########
benchmarks/bench.sh:
##########
@@ -1137,6 +1144,62 @@ run_sort_pushdown_sorted() {
debug_run $CARGO_COMMAND --bin dfbench -- sort-pushdown --sorted
--iterations 5 --path "${SORT_PUSHDOWN_DIR}" --queries-path
"${SCRIPT_DIR}/queries/sort_pushdown" -o "${RESULTS_FILE}" ${QUERY_ARG}
${LATENCY_ARG}
}
+# Generates data for sort pushdown Inexact benchmark.
+#
+# Produces a single large lineitem parquet file where row groups have
+# NON-OVERLAPPING but OUT-OF-ORDER l_orderkey ranges (each RG internally
+# sorted, RGs shuffled). This simulates append-heavy workloads where data
+# is written in batches at different times.
+data_sort_pushdown_inexact() {
+ INEXACT_DIR="${DATA_DIR}/sort_pushdown_inexact/lineitem"
+ if [ -d "${INEXACT_DIR}" ] && [ "$(ls -A ${INEXACT_DIR}/*.parquet
2>/dev/null)" ]; then
+ echo "Sort pushdown Inexact data already exists at ${INEXACT_DIR}"
+ return
+ fi
+
+ echo "Generating sort pushdown Inexact benchmark data (single file,
shuffled RGs)..."
+
+ # Re-use the sort_pushdown data as the source (generate if missing)
+ data_sort_pushdown
+
+ mkdir -p "${INEXACT_DIR}"
+ SRC_DIR="${DATA_DIR}/sort_pushdown/lineitem"
+
+ # Use datafusion-cli to bucket rows into 64 groups by a deterministic
+ # scrambler, then sort within each bucket by orderkey. This produces
+ # ~64 RG-sized segments where each has a tight orderkey range but the
+ # segments appear in scrambled (non-sorted) order in the file.
Review Comment:
Great suggestion! Partially overlapping RGs from streaming data is a very
realistic scenario. I will add a benchmark variant for this pattern when I
update the PR — something like time-ordered chunks with small overlaps between
adjacent chunks to simulate network delays / time skew.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]