zhuqi-lucas commented on code in PR #19042:
URL: https://github.com/apache/datafusion/pull/19042#discussion_r2594654703
##########
benchmarks/bench.sh:
##########
@@ -1197,10 +1206,105 @@ compare_benchmarks() {
}
+# Creates sorted ClickBench data from hits.parquet (full dataset)
+# The data is sorted by EventTime in ascending order
+# Uses datafusion-cli to reduce dependencies
+data_sorted_clickbench() {
+ SORTED_FILE="${DATA_DIR}/hits_sorted.parquet"
+ ORIGINAL_FILE="${DATA_DIR}/hits.parquet"
+
+ # Default memory limit is 12GB, can be overridden with
DATAFUSION_MEMORY_GB env var
+ MEMORY_LIMIT_GB=${DATAFUSION_MEMORY_GB:-12}
+
+ echo "Creating sorted ClickBench dataset from hits.parquet..."
+ echo "Configuration:"
+ echo " Memory limit: ${MEMORY_LIMIT_GB}G"
+ echo " Row group size: 64K rows"
+ echo " Compression: uncompressed"
+
+ if [ ! -f "${ORIGINAL_FILE}" ]; then
+ echo "hits.parquet not found. Running data_clickbench_1 first..."
+ data_clickbench_1
+ fi
+
+ if [ -f "${SORTED_FILE}" ]; then
+ echo "Sorted hits.parquet already exists at ${SORTED_FILE}"
+ return 0
+ fi
+
+ echo "Sorting hits.parquet by EventTime (this may take several minutes)..."
+
+ pushd "${DATAFUSION_DIR}" > /dev/null
+ echo "Building datafusion-cli..."
+ cargo build --release --bin datafusion-cli
+ DATAFUSION_CLI="${DATAFUSION_DIR}/target/release/datafusion-cli"
+ popd > /dev/null
+
+ echo "Using datafusion-cli to create sorted parquet file..."
+ "${DATAFUSION_CLI}" << EOF
+-- Memory and performance configuration
+SET datafusion.runtime.memory_limit = '${MEMORY_LIMIT_GB}G';
+SET datafusion.execution.spill_compression = 'uncompressed';
+SET datafusion.execution.sort_spill_reservation_bytes = 10485760; -- 10MB
+SET datafusion.execution.batch_size = 8192;
+SET datafusion.execution.target_partitions = 1;
Review Comment:
```rust
Resources exhausted: Additional allocation failed for ExternalSorterMerge[1]
with top memory consumers (across reservations) as:
ExternalSorter[2]#13(can spill: true) consumed 3.7 GB, peak 4.8 GB,
ExternalSorter[3]#15(can spill: true) consumed 3.5 GB, peak 4.4 GB,
ExternalSorterMerge[2]#14(can spill: false) consumed 2.3 GB, peak 2.3 GB,
ExternalSorterMerge[1]#12(can spill: false) consumed 1004.2 MB, peak
1694.0 MB,
ExternalSorterMerge[3]#16(can spill: false) consumed 845.7 MB, peak 1798.9
MB.
Error: Failed to allocate additional 12.7 MB for ExternalSorterMerge[1] with
998.9 MB already allocated for this reservation - 689.7 KB remain available for
the total pool
\q
```
ExternalSorterMerge seems to cause the Resources exhausted, when we have
more than one partition.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]