(datafusion) branch main updated: Add Tuning Guide for small data / short queries (#17040)

xudong963 Tue, 05 Aug 2025 04:29:12 -0700

This is an automated email from the ASF dual-hosted git repository.

xudong963 pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/datafusion.git



The following commit(s) were added to refs/heads/main by this push:
     new f10deb67a3 Add Tuning Guide for small data / short queries (#17040)
f10deb67a3 is described below

commit f10deb67a3d34943027f6043f07b2af31f60c014
Author: Andrew Lamb <and...@nerdnetworks.org>
AuthorDate: Tue Aug 5 07:29:02 2025 -0400

    Add Tuning Guide for small data / short queries (#17040)
---
 dev/update_config_docs.sh         | 32 ++++++++++++++++++++++++++++++++
 docs/source/user-guide/configs.md | 28 ++++++++++++++++++++++++++++
 2 files changed, 60 insertions(+)

diff --git a/dev/update_config_docs.sh b/dev/update_config_docs.sh
index 7baee8ee00..3052f9b803 100755
--- a/dev/update_config_docs.sh
+++ b/dev/update_config_docs.sh
@@ -119,6 +119,38 @@ EOF
 echo "Running CLI and inserting runtime config docs table"
 $PRINT_RUNTIME_CONFIG_DOCS_COMMAND >> "$TARGET_FILE"
 
+cat <<'EOF' >> "$TARGET_FILE"
+
+# Tuning Guide
+
+## Short Queries
+
+By default DataFusion will attempt to maximize parallelism and use all cores --
+For example, if you have 32 cores, each plan will split the data into 32
+partitions. However, if your data is small, the overhead of splitting the data
+to enable parallelization can dominate the actual computation.
+
+You can find out how many cores are being used via the [`EXPLAIN`] command and 
look
+at the number of partitions in the plan.
+
+[`EXPLAIN`]: sql/explain.md
+
+The `datafusion.optimizer.repartition_file_min_size` option controls the 
minimum file size the
+[`ListingTable`] provider will attempt to repartition. However, this
+does not apply to user defined data sources and only works when DataFusion has 
accurate statistics.
+
+If you know your data is small, you can set the 
`datafusion.execution.target_partitions`
+option to a smaller number to reduce the overhead of repartitioning. For very 
small datasets (e.g. less
+than 1MB), we recommend setting `target_partitions` to 1 to avoid 
repartitioning altogether.
+
+```sql
+SET datafusion.execution.target_partitions = '1';
+```
+
+[`ListingTable`]: 
https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingTable.html
+
+EOF
+
 
 echo "Running prettier"
 npx prettier@2.3.2 --write "$TARGET_FILE"
diff --git a/docs/source/user-guide/configs.md 
b/docs/source/user-guide/configs.md
index c817daad2c..3fc8e98437 100644
--- a/docs/source/user-guide/configs.md
+++ b/docs/source/user-guide/configs.md
@@ -192,3 +192,31 @@ The following runtime configuration settings are available:
 | datafusion.runtime.max_temp_directory_size | 100G    | Maximum temporary 
file directory size. Supports suffixes K (kilobytes), M (megabytes), and G 
(gigabytes). Example: '2G' for 2 gigabytes.    |
 | datafusion.runtime.memory_limit            | NULL    | Maximum memory limit 
for query execution. Supports suffixes K (kilobytes), M (megabytes), and G 
(gigabytes). Example: '2G' for 2 gigabytes. |
 | datafusion.runtime.temp_directory          | NULL    | The path to the 
temporary file directory.                                                       
                                            |
+
+# Tuning Guide
+
+## Short Queries
+
+By default DataFusion will attempt to maximize parallelism and use all cores --
+For example, if you have 32 cores, each plan will split the data into 32
+partitions. However, if your data is small, the overhead of splitting the data
+to enable parallelization can dominate the actual computation.
+
+You can find out how many cores are being used via the [`EXPLAIN`] command and 
look
+at the number of partitions in the plan.
+
+[`explain`]: sql/explain.md
+
+The `datafusion.optimizer.repartition_file_min_size` option controls the 
minimum file size the
+[`ListingTable`] provider will attempt to repartition. However, this
+does not apply to user defined data sources and only works when DataFusion has 
accurate statistics.
+
+If you know your data is small, you can set the 
`datafusion.execution.target_partitions`
+option to a smaller number to reduce the overhead of repartitioning. For very 
small datasets (e.g. less
+than 1MB), we recommend setting `target_partitions` to 1 to avoid 
repartitioning altogether.
+
+```sql
+SET datafusion.execution.target_partitions = '1';
+```
+
+[`listingtable`]: 
https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingTable.html


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@datafusion.apache.org
For additional commands, e-mail: commits-h...@datafusion.apache.org

(datafusion) branch main updated: Add Tuning Guide for small data / short queries (#17040)

Reply via email to