This is an automated email from the ASF dual-hosted git repository. xudong963 pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/datafusion.git
The following commit(s) were added to refs/heads/main by this push: new f10deb67a3 Add Tuning Guide for small data / short queries (#17040) f10deb67a3 is described below commit f10deb67a3d34943027f6043f07b2af31f60c014 Author: Andrew Lamb <and...@nerdnetworks.org> AuthorDate: Tue Aug 5 07:29:02 2025 -0400 Add Tuning Guide for small data / short queries (#17040) --- dev/update_config_docs.sh | 32 ++++++++++++++++++++++++++++++++ docs/source/user-guide/configs.md | 28 ++++++++++++++++++++++++++++ 2 files changed, 60 insertions(+) diff --git a/dev/update_config_docs.sh b/dev/update_config_docs.sh index 7baee8ee00..3052f9b803 100755 --- a/dev/update_config_docs.sh +++ b/dev/update_config_docs.sh @@ -119,6 +119,38 @@ EOF echo "Running CLI and inserting runtime config docs table" $PRINT_RUNTIME_CONFIG_DOCS_COMMAND >> "$TARGET_FILE" +cat <<'EOF' >> "$TARGET_FILE" + +# Tuning Guide + +## Short Queries + +By default DataFusion will attempt to maximize parallelism and use all cores -- +For example, if you have 32 cores, each plan will split the data into 32 +partitions. However, if your data is small, the overhead of splitting the data +to enable parallelization can dominate the actual computation. + +You can find out how many cores are being used via the [`EXPLAIN`] command and look +at the number of partitions in the plan. + +[`EXPLAIN`]: sql/explain.md + +The `datafusion.optimizer.repartition_file_min_size` option controls the minimum file size the +[`ListingTable`] provider will attempt to repartition. However, this +does not apply to user defined data sources and only works when DataFusion has accurate statistics. + +If you know your data is small, you can set the `datafusion.execution.target_partitions` +option to a smaller number to reduce the overhead of repartitioning. For very small datasets (e.g. less +than 1MB), we recommend setting `target_partitions` to 1 to avoid repartitioning altogether. + +```sql +SET datafusion.execution.target_partitions = '1'; +``` + +[`ListingTable`]: https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingTable.html + +EOF + echo "Running prettier" npx prettier@2.3.2 --write "$TARGET_FILE" diff --git a/docs/source/user-guide/configs.md b/docs/source/user-guide/configs.md index c817daad2c..3fc8e98437 100644 --- a/docs/source/user-guide/configs.md +++ b/docs/source/user-guide/configs.md @@ -192,3 +192,31 @@ The following runtime configuration settings are available: | datafusion.runtime.max_temp_directory_size | 100G | Maximum temporary file directory size. Supports suffixes K (kilobytes), M (megabytes), and G (gigabytes). Example: '2G' for 2 gigabytes. | | datafusion.runtime.memory_limit | NULL | Maximum memory limit for query execution. Supports suffixes K (kilobytes), M (megabytes), and G (gigabytes). Example: '2G' for 2 gigabytes. | | datafusion.runtime.temp_directory | NULL | The path to the temporary file directory. | + +# Tuning Guide + +## Short Queries + +By default DataFusion will attempt to maximize parallelism and use all cores -- +For example, if you have 32 cores, each plan will split the data into 32 +partitions. However, if your data is small, the overhead of splitting the data +to enable parallelization can dominate the actual computation. + +You can find out how many cores are being used via the [`EXPLAIN`] command and look +at the number of partitions in the plan. + +[`explain`]: sql/explain.md + +The `datafusion.optimizer.repartition_file_min_size` option controls the minimum file size the +[`ListingTable`] provider will attempt to repartition. However, this +does not apply to user defined data sources and only works when DataFusion has accurate statistics. + +If you know your data is small, you can set the `datafusion.execution.target_partitions` +option to a smaller number to reduce the overhead of repartitioning. For very small datasets (e.g. less +than 1MB), we recommend setting `target_partitions` to 1 to avoid repartitioning altogether. + +```sql +SET datafusion.execution.target_partitions = '1'; +``` + +[`listingtable`]: https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingTable.html --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@datafusion.apache.org For additional commands, e-mail: commits-h...@datafusion.apache.org