This is an automated email from the ASF dual-hosted git repository. github-bot pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/datafusion.git
The following commit(s) were added to refs/heads/asf-site by this push: new 9cde957da6 Publish built docs triggered by f10deb67a3d34943027f6043f07b2af31f60c014 9cde957da6 is described below commit 9cde957da65b32a0fdedccc078e42eface050972 Author: github-actions[bot] <github-actions[bot]@users.noreply.github.com> AuthorDate: Tue Aug 5 11:29:36 2025 +0000 Publish built docs triggered by f10deb67a3d34943027f6043f07b2af31f60c014 --- _sources/user-guide/configs.md.txt | 28 ++++++++++++++++++++++++++++ index.html | 1 + searchindex.js | 2 +- user-guide/configs.html | 33 +++++++++++++++++++++++++++++++++ 4 files changed, 63 insertions(+), 1 deletion(-) diff --git a/_sources/user-guide/configs.md.txt b/_sources/user-guide/configs.md.txt index c817daad2c..3fc8e98437 100644 --- a/_sources/user-guide/configs.md.txt +++ b/_sources/user-guide/configs.md.txt @@ -192,3 +192,31 @@ The following runtime configuration settings are available: | datafusion.runtime.max_temp_directory_size | 100G | Maximum temporary file directory size. Supports suffixes K (kilobytes), M (megabytes), and G (gigabytes). Example: '2G' for 2 gigabytes. | | datafusion.runtime.memory_limit | NULL | Maximum memory limit for query execution. Supports suffixes K (kilobytes), M (megabytes), and G (gigabytes). Example: '2G' for 2 gigabytes. | | datafusion.runtime.temp_directory | NULL | The path to the temporary file directory. | + +# Tuning Guide + +## Short Queries + +By default DataFusion will attempt to maximize parallelism and use all cores -- +For example, if you have 32 cores, each plan will split the data into 32 +partitions. However, if your data is small, the overhead of splitting the data +to enable parallelization can dominate the actual computation. + +You can find out how many cores are being used via the [`EXPLAIN`] command and look +at the number of partitions in the plan. + +[`explain`]: sql/explain.md + +The `datafusion.optimizer.repartition_file_min_size` option controls the minimum file size the +[`ListingTable`] provider will attempt to repartition. However, this +does not apply to user defined data sources and only works when DataFusion has accurate statistics. + +If you know your data is small, you can set the `datafusion.execution.target_partitions` +option to a smaller number to reduce the overhead of repartitioning. For very small datasets (e.g. less +than 1MB), we recommend setting `target_partitions` to 1 to avoid repartitioning altogether. + +```sql +SET datafusion.execution.target_partitions = '1'; +``` + +[`listingtable`]: https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingTable.html diff --git a/index.html b/index.html index 1659ac2b80..f87f732139 100644 --- a/index.html +++ b/index.html @@ -645,6 +645,7 @@ See the <a class="reference external" href="https://datafusion.apache.org/contri <li class="toctree-l1"><a class="reference internal" href="user-guide/sql/index.html">SQL Reference</a></li> <li class="toctree-l1"><a class="reference internal" href="user-guide/configs.html">Configuration Settings</a></li> <li class="toctree-l1"><a class="reference internal" href="user-guide/configs.html#runtime-configuration-settings">Runtime Configuration Settings</a></li> +<li class="toctree-l1"><a class="reference internal" href="user-guide/configs.html#tuning-guide">Tuning Guide</a></li> <li class="toctree-l1"><a class="reference internal" href="user-guide/explain-usage.html">Reading Explain Plans</a></li> <li class="toctree-l1"><a class="reference internal" href="user-guide/faq.html">Frequently Asked Questions</a></li> <li class="toctree-l1"><a class="reference internal" href="user-guide/faq.html#how-does-datafusion-compare-with-xyz">How does DataFusion Compare with <code class="docutils literal notranslate"><span class="pre">XYZ</span></code>?</a></li> diff --git a/searchindex.js b/searchindex.js index 5a3649bdf8..d8193cdf49 100644 --- a/searchindex.js +++ b/searchindex.js @@ -1 +1 @@ -Search.setIndex({"alltitles":{"!=":[[56,"op-neq"]],"!~":[[56,"op-re-not-match"]],"!~*":[[56,"op-re-not-match-i"]],"!~~":[[56,"id19"]],"!~~*":[[56,"id20"]],"#":[[56,"op-bit-xor"]],"%":[[56,"op-modulo"]],"&":[[56,"op-bit-and"]],"(relation, name) tuples in logical fields and logical columns are unique":[[12,"relation-name-tuples-in-logical-fields-and-logical-columns-are-unique"]],"*":[[56,"op-multiply"]],"+":[[56,"op-plus"]],"-":[[56,"op-minus"]],"/":[[56,"op-divide"]],"<":[[56,"op-lt"]],"< [...] \ No newline at end of file +Search.setIndex({"alltitles":{"!=":[[56,"op-neq"]],"!~":[[56,"op-re-not-match"]],"!~*":[[56,"op-re-not-match-i"]],"!~~":[[56,"id19"]],"!~~*":[[56,"id20"]],"#":[[56,"op-bit-xor"]],"%":[[56,"op-modulo"]],"&":[[56,"op-bit-and"]],"(relation, name) tuples in logical fields and logical columns are unique":[[12,"relation-name-tuples-in-logical-fields-and-logical-columns-are-unique"]],"*":[[56,"op-multiply"]],"+":[[56,"op-plus"]],"-":[[56,"op-minus"]],"/":[[56,"op-divide"]],"<":[[56,"op-lt"]],"< [...] \ No newline at end of file diff --git a/user-guide/configs.html b/user-guide/configs.html index 76e2adae5a..34ba535724 100644 --- a/user-guide/configs.html +++ b/user-guide/configs.html @@ -578,6 +578,18 @@ Runtime Configuration Settings </a> </li> + <li class="toc-h1 nav-item toc-entry"> + <a class="reference internal nav-link" href="#tuning-guide"> + Tuning Guide + </a> + <ul class="visible nav section-nav flex-column"> + <li class="toc-h2 nav-item toc-entry"> + <a class="reference internal nav-link" href="#short-queries"> + Short Queries + </a> + </li> + </ul> + </li> </ul> </nav> @@ -1139,6 +1151,27 @@ example, to configure <code class="docutils literal notranslate"><span class="pr </tr> </tbody> </table> +</section> +<section id="tuning-guide"> +<h1>Tuning Guide<a class="headerlink" href="#tuning-guide" title="Link to this heading">¶</a></h1> +<section id="short-queries"> +<h2>Short Queries<a class="headerlink" href="#short-queries" title="Link to this heading">¶</a></h2> +<p>By default DataFusion will attempt to maximize parallelism and use all cores – +For example, if you have 32 cores, each plan will split the data into 32 +partitions. However, if your data is small, the overhead of splitting the data +to enable parallelization can dominate the actual computation.</p> +<p>You can find out how many cores are being used via the <a class="reference internal" href="sql/explain.html"><span class="std std-doc"><code class="docutils literal notranslate"><span class="pre">EXPLAIN</span></code></span></a> command and look +at the number of partitions in the plan.</p> +<p>The <code class="docutils literal notranslate"><span class="pre">datafusion.optimizer.repartition_file_min_size</span></code> option controls the minimum file size the +<a class="reference external" href="https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingTable.html"><code class="docutils literal notranslate"><span class="pre">ListingTable</span></code></a> provider will attempt to repartition. However, this +does not apply to user defined data sources and only works when DataFusion has accurate statistics.</p> +<p>If you know your data is small, you can set the <code class="docutils literal notranslate"><span class="pre">datafusion.execution.target_partitions</span></code> +option to a smaller number to reduce the overhead of repartitioning. For very small datasets (e.g. less +than 1MB), we recommend setting <code class="docutils literal notranslate"><span class="pre">target_partitions</span></code> to 1 to avoid repartitioning altogether.</p> +<div class="highlight-sql notranslate"><div class="highlight"><pre><span></span><span class="k">SET</span><span class="w"> </span><span class="n">datafusion</span><span class="p">.</span><span class="n">execution</span><span class="p">.</span><span class="n">target_partitions</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'1'</span><span class="p">;</span> +</pre></div> +</div> +</section> </section> --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@datafusion.apache.org For additional commands, e-mail: commits-h...@datafusion.apache.org