Re: [PR] Add sorted data benchmark. [datafusion]

via GitHub Fri, 05 Dec 2025 01:50:42 -0800


zhuqi-lucas commented on code in PR #19042:
URL: https://github.com/apache/datafusion/pull/19042#discussion_r2592056916



##########
benchmarks/bench.sh:
##########
@@ -1197,10 +1206,105 @@ compare_benchmarks() {
 
 }
 
+# Creates sorted ClickBench data from hits.parquet (full dataset)
+# The data is sorted by EventTime in ascending order
+# Uses datafusion-cli to reduce dependencies
+data_sorted_clickbench() {
+    SORTED_FILE="${DATA_DIR}/hits_sorted.parquet"
+    ORIGINAL_FILE="${DATA_DIR}/hits.parquet"
+
+    # Default memory limit is 12GB, can be overridden with 
DATAFUSION_MEMORY_GB env var
+    MEMORY_LIMIT_GB=${DATAFUSION_MEMORY_GB:-12}
+
+    echo "Creating sorted ClickBench dataset from hits.parquet..."
+    echo "Configuration:"
+    echo "  Memory limit: ${MEMORY_LIMIT_GB}G"
+    echo "  Row group size: 64K rows"
+    echo "  Compression: uncompressed"
+
+    if [ ! -f "${ORIGINAL_FILE}" ]; then
+        echo "hits.parquet not found. Running data_clickbench_1 first..."
+        data_clickbench_1
+    fi
+
+    if [ -f "${SORTED_FILE}" ]; then
+        echo "Sorted hits.parquet already exists at ${SORTED_FILE}"
+        return 0
+    fi
+
+    echo "Sorting hits.parquet by EventTime (this may take several minutes)..."
+
+    pushd "${DATAFUSION_DIR}" > /dev/null
+    echo "Building datafusion-cli..."
+    cargo build --release --bin datafusion-cli
+    DATAFUSION_CLI="${DATAFUSION_DIR}/target/release/datafusion-cli"
+    popd > /dev/null
+
+    echo "Using datafusion-cli to create sorted parquet file..."
+    "${DATAFUSION_CLI}" << EOF
+-- Memory and performance configuration
+SET datafusion.runtime.memory_limit = '${MEMORY_LIMIT_GB}G';
+SET datafusion.execution.spill_compression = 'uncompressed';
+SET datafusion.execution.sort_spill_reservation_bytes = 10485760; -- 10MB
+SET datafusion.execution.batch_size = 8192;
+SET datafusion.execution.target_partitions = 1;

Review Comment:
   But it works for the huge data set:
   
   ```
      Running `/Users/zhuqi/arrow-datafusion/target/release/dfbench clickbench 
--iterations 5 --path 
/Users/zhuqi/arrow-datafusion/benchmarks/data/hits_sorted.parquet 
--queries-path 
/Users/zhuqi/arrow-datafusion/benchmarks/queries/clickbench/queries/sorted_data 
--sorted-by EventTime -c datafusion.optimizer.prefer_existing_sort=true -o 
/Users/zhuqi/arrow-datafusion/benchmarks/results/issue_18976/data_sorted_clickbench.json`
   Running benchmarks with the following options: RunOpt { query: None, 
pushdown: false, common: CommonOpt { iterations: 5, partitions: None, 
batch_size: None, mem_pool_type: "fair", memory_limit: None, 
sort_spill_reservation_bytes: None, debug: false }, path: 
"/Users/zhuqi/arrow-datafusion/benchmarks/data/hits_sorted.parquet", 
queries_path: 
"/Users/zhuqi/arrow-datafusion/benchmarks/queries/clickbench/queries/sorted_data",
 output_path: 
Some("/Users/zhuqi/arrow-datafusion/benchmarks/results/issue_18976/data_sorted_clickbench.json"),
 sorted_by: Some("EventTime"), sort_order: "ASC", config_options: 
["datafusion.optimizer.prefer_existing_sort=true"] }
   ℹ️  Data is registered with sort order
   Setting config: datafusion.optimizer.prefer_existing_sort = true
   Registering table with sort order: EventTime ASC
   Executing: CREATE EXTERNAL TABLE hits STORED AS PARQUET LOCATION 
'/Users/zhuqi/arrow-datafusion/benchmarks/data/hits_sorted.parquet' WITH ORDER 
("EventTime" ASC)
   Q0: -- Must set for ClickBench hits_partitioned dataset. See 
https://github.com/apache/datafusion/issues/16591
   -- set datafusion.execution.parquet.binary_as_string = true
   SELECT * FROM hits ORDER BY "EventTime" DESC limit 10;
   
   Query 0 iteration 0 took 2388.0 ms and returned 10 rows
   Query 0 iteration 1 took 1789.9 ms and returned 10 rows
   Query 0 iteration 2 took 1844.1 ms and returned 10 rows
   Query 0 iteration 3 took 1816.4 ms and returned 10 rows
   Query 0 iteration 4 took 1808.9 ms and returned 10 rows
   Query 0 avg time: 1929.46 ms
   + set +x
   Done
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add sorted data benchmark. [datafusion]

Reply via email to