This is an automated email from the ASF dual-hosted git repository. github-bot pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/datafusion.git
The following commit(s) were added to refs/heads/asf-site by this push: new cf8437a5a6 Publish built docs triggered by 2d7ae09262f7a1338c30192b33efbe1b2d1d9829 cf8437a5a6 is described below commit cf8437a5a6d88e03c74626daa5535c6f01d65d0a Author: github-actions[bot] <github-actions[bot]@users.noreply.github.com> AuthorDate: Thu Jun 19 18:10:22 2025 +0000 Publish built docs triggered by 2d7ae09262f7a1338c30192b33efbe1b2d1d9829 --- _sources/library-user-guide/upgrading.md.txt | 18 ++++++++++++++-- _sources/user-guide/configs.md.txt | 2 +- _sources/user-guide/sql/ddl.md.txt | 8 +++---- library-user-guide/upgrading.html | 32 ++++++++++++++++++++++++++-- searchindex.js | 2 +- user-guide/configs.html | 4 ++-- user-guide/sql/ddl.html | 8 +++---- 7 files changed, 58 insertions(+), 16 deletions(-) diff --git a/_sources/library-user-guide/upgrading.md.txt b/_sources/library-user-guide/upgrading.md.txt index b502850b59..613c2be43d 100644 --- a/_sources/library-user-guide/upgrading.md.txt +++ b/_sources/library-user-guide/upgrading.md.txt @@ -21,6 +21,21 @@ ## DataFusion `49.0.0` +### `datafusion.execution.collect_statistics` now defaults to `true` + +The default value of the `datafusion.execution.collect_statistics` configuration +setting is now true. This change impacts users that use that value directly and relied +on its default value being `false`. + +This change also restores the default behavior of `ListingTable` to its previous. If you use it directly +you can maintain the current behavior by overriding the default value in your code. + +```rust +ListingOptions::new(Arc::new(ParquetFormat::default())) + .with_collect_stat(false) + // other options +``` + ### Metadata is now represented by `FieldMetadata` Metadata from the Arrow `Field` is now stored using the `FieldMetadata` @@ -127,7 +142,7 @@ match expr { [details on #16207]: https://github.com/apache/datafusion/pull/16207#issuecomment-2922659103 -### The `VARCHAR` SQL type is now represented as `Utf8View` in Arrow. +### The `VARCHAR` SQL type is now represented as `Utf8View` in Arrow The mapping of the SQL `VARCHAR` type has been changed from `Utf8` to `Utf8View` which improves performance for many string operations. You can read more about @@ -277,7 +292,6 @@ Additionally `ObjectStore::list` and `ObjectStore::list_with_offset` have been c [#6619]: https://github.com/apache/arrow-rs/pull/6619 [#7371]: https://github.com/apache/arrow-rs/pull/7371 -[#7328]: https://github.com/apache/arrow-rs/pull/6961 This requires converting from `usize` to `u64` occasionally as well as changes to `ObjectStore` implementations such as diff --git a/_sources/user-guide/configs.md.txt b/_sources/user-guide/configs.md.txt index 1b8233a541..b55e63293f 100644 --- a/_sources/user-guide/configs.md.txt +++ b/_sources/user-guide/configs.md.txt @@ -47,7 +47,7 @@ Environment variables are read during `SessionConfig` initialisation so they mus | datafusion.catalog.newlines_in_values | false | Specifies whether newlines in (quoted) CSV values are supported. This is the default value for `format.newlines_in_values` for `CREATE EXTERNAL TABLE` if not specified explicitly in the statement. Parsing newlines in quoted values may be affected by execution behaviour such as parallel file scanning. Setting this to `true` ensures that newlines in values are parsed successfully, which [...] | datafusion.execution.batch_size | 8192 | Default batch size while creating new batches, it's especially useful for buffer-in-memory batches since creating tiny batches would result in too much metadata memory consumption [...] | datafusion.execution.coalesce_batches | true | When set to true, record batches will be examined between each operator and small batches will be coalesced into larger batches. This is helpful when there are highly selective filters or joins that could produce tiny output batches. The target batch size is determined by the configuration setting [...] -| datafusion.execution.collect_statistics | false | Should DataFusion collect statistics when first creating a table. Has no effect after the table is created. Applies to the default `ListingTableProvider` in DataFusion. Defaults to false. [...] +| datafusion.execution.collect_statistics | true | Should DataFusion collect statistics when first creating a table. Has no effect after the table is created. Applies to the default `ListingTableProvider` in DataFusion. Defaults to true. [...] | datafusion.execution.target_partitions | 0 | Number of partitions for query execution. Increasing partitions can increase concurrency. Defaults to the number of CPU cores on the system [...] | datafusion.execution.time_zone | +00:00 | The default time zone Some functions, e.g. `EXTRACT(HOUR from SOME_TIME)`, shift the underlying datetime according to this time zone, and then extract the hour [...] | datafusion.execution.parquet.enable_page_index | true | (reading) If true, reads the Parquet data page level metadata (the Page Index), if present, to reduce the I/O and number of rows decoded. [...] diff --git a/_sources/user-guide/sql/ddl.md.txt b/_sources/user-guide/sql/ddl.md.txt index ff8fa9bac0..1d971594ad 100644 --- a/_sources/user-guide/sql/ddl.md.txt +++ b/_sources/user-guide/sql/ddl.md.txt @@ -95,14 +95,14 @@ LOCATION '/mnt/nyctaxi/tripdata.parquet'; :::{note} Statistics -: By default, when a table is created, DataFusion will _NOT_ read the files +: By default, when a table is created, DataFusion will read the files to gather statistics, which can be expensive but can accelerate subsequent -queries substantially. If you want to gather statistics +queries substantially. If you don't want to gather statistics when creating a table, set the `datafusion.execution.collect_statistics` -configuration option to `true` before creating the table. For example: +configuration option to `false` before creating the table. For example: ```sql -SET datafusion.execution.collect_statistics = true; +SET datafusion.execution.collect_statistics = false; ``` See the [config settings docs](../configs.md) for more details. diff --git a/library-user-guide/upgrading.html b/library-user-guide/upgrading.html index 734eec41b4..baa1e14504 100644 --- a/library-user-guide/upgrading.html +++ b/library-user-guide/upgrading.html @@ -559,6 +559,21 @@ </code> </a> <ul class="nav section-nav flex-column"> + <li class="toc-h3 nav-item toc-entry"> + <a class="reference internal nav-link" href="#datafusion-execution-collect-statistics-now-defaults-to-true"> + <code class="docutils literal notranslate"> + <span class="pre"> + datafusion.execution.collect_statistics + </span> + </code> + now defaults to + <code class="docutils literal notranslate"> + <span class="pre"> + true + </span> + </code> + </a> + </li> <li class="toc-h3 nav-item toc-entry"> <a class="reference internal nav-link" href="#metadata-is-now-represented-by-fieldmetadata"> Metadata is now represented by @@ -621,7 +636,7 @@ Utf8View </span> </code> - in Arrow. + in Arrow </a> </li> <li class="toc-h3 nav-item toc-entry"> @@ -950,6 +965,19 @@ <h1>Upgrade Guides<a class="headerlink" href="#upgrade-guides" title="Link to this heading">¶</a></h1> <section id="datafusion-49-0-0"> <h2>DataFusion <code class="docutils literal notranslate"><span class="pre">49.0.0</span></code><a class="headerlink" href="#datafusion-49-0-0" title="Link to this heading">¶</a></h2> +<section id="datafusion-execution-collect-statistics-now-defaults-to-true"> +<h3><code class="docutils literal notranslate"><span class="pre">datafusion.execution.collect_statistics</span></code> now defaults to <code class="docutils literal notranslate"><span class="pre">true</span></code><a class="headerlink" href="#datafusion-execution-collect-statistics-now-defaults-to-true" title="Link to this heading">¶</a></h3> +<p>The default value of the <code class="docutils literal notranslate"><span class="pre">datafusion.execution.collect_statistics</span></code> configuration +setting is now true. This change impacts users that use that value directly and relied +on its default value being <code class="docutils literal notranslate"><span class="pre">false</span></code>.</p> +<p>This change also restores the default behavior of <code class="docutils literal notranslate"><span class="pre">ListingTable</span></code> to its previous. If you use it directly +you can maintain the current behavior by overriding the default value in your code.</p> +<div class="highlight-rust notranslate"><div class="highlight"><pre><span></span><span class="n">ListingOptions</span><span class="p">::</span><span class="n">new</span><span class="p">(</span><span class="n">Arc</span><span class="p">::</span><span class="n">new</span><span class="p">(</span><span class="n">ParquetFormat</span><span class="p">::</span><span class="n">default</span><span class="p">()))</span> +<span class="w"> </span><span class="p">.</span><span class="n">with_collect_stat</span><span class="p">(</span><span class="kc">false</span><span class="p">)</span> +<span class="w"> </span><span class="c1">// other options</span> +</pre></div> +</div> +</section> <section id="metadata-is-now-represented-by-fieldmetadata"> <h3>Metadata is now represented by <code class="docutils literal notranslate"><span class="pre">FieldMetadata</span></code><a class="headerlink" href="#metadata-is-now-represented-by-fieldmetadata" title="Link to this heading">¶</a></h3> <p>Metadata from the Arrow <code class="docutils literal notranslate"><span class="pre">Field</span></code> is now stored using the <code class="docutils literal notranslate"><span class="pre">FieldMetadata</span></code> @@ -1037,7 +1065,7 @@ on <code class="docutils literal notranslate"><span class="pre">Expr::WindowFunc </div> </section> <section id="the-varchar-sql-type-is-now-represented-as-utf8view-in-arrow"> -<h3>The <code class="docutils literal notranslate"><span class="pre">VARCHAR</span></code> SQL type is now represented as <code class="docutils literal notranslate"><span class="pre">Utf8View</span></code> in Arrow.<a class="headerlink" href="#the-varchar-sql-type-is-now-represented-as-utf8view-in-arrow" title="Link to this heading">¶</a></h3> +<h3>The <code class="docutils literal notranslate"><span class="pre">VARCHAR</span></code> SQL type is now represented as <code class="docutils literal notranslate"><span class="pre">Utf8View</span></code> in Arrow<a class="headerlink" href="#the-varchar-sql-type-is-now-represented-as-utf8view-in-arrow" title="Link to this heading">¶</a></h3> <p>The mapping of the SQL <code class="docutils literal notranslate"><span class="pre">VARCHAR</span></code> type has been changed from <code class="docutils literal notranslate"><span class="pre">Utf8</span></code> to <code class="docutils literal notranslate"><span class="pre">Utf8View</span></code> which improves performance for many string operations. You can read more about <code class="docutils literal notranslate"><span class="pre">Utf8View</span></code> in the <a class="reference external" href="https://datafusion.apache.org/blog/2024/09/13/string-view-german-style-strings-part-1/">DataFusion blog post on German-style strings</a></p> diff --git a/searchindex.js b/searchindex.js index 089f216068..bf091943de 100644 --- a/searchindex.js +++ b/searchindex.js @@ -1 +1 @@ -Search.setIndex({"alltitles":{"!=":[[57,"op-neq"]],"!~":[[57,"op-re-not-match"]],"!~*":[[57,"op-re-not-match-i"]],"!~~":[[57,"id19"]],"!~~*":[[57,"id20"]],"#":[[57,"op-bit-xor"]],"%":[[57,"op-modulo"]],"&":[[57,"op-bit-and"]],"(relation, name) tuples in logical fields and logical columns are unique":[[12,"relation-name-tuples-in-logical-fields-and-logical-columns-are-unique"]],"*":[[57,"op-multiply"]],"+":[[57,"op-plus"]],"-":[[57,"op-minus"]],"/":[[57,"op-divide"]],"<":[[57,"op-lt"]],"< [...] \ No newline at end of file +Search.setIndex({"alltitles":{"!=":[[57,"op-neq"]],"!~":[[57,"op-re-not-match"]],"!~*":[[57,"op-re-not-match-i"]],"!~~":[[57,"id19"]],"!~~*":[[57,"id20"]],"#":[[57,"op-bit-xor"]],"%":[[57,"op-modulo"]],"&":[[57,"op-bit-and"]],"(relation, name) tuples in logical fields and logical columns are unique":[[12,"relation-name-tuples-in-logical-fields-and-logical-columns-are-unique"]],"*":[[57,"op-multiply"]],"+":[[57,"op-plus"]],"-":[[57,"op-minus"]],"/":[[57,"op-divide"]],"<":[[57,"op-lt"]],"< [...] \ No newline at end of file diff --git a/user-guide/configs.html b/user-guide/configs.html index df9a7adab7..5e14371193 100644 --- a/user-guide/configs.html +++ b/user-guide/configs.html @@ -654,8 +654,8 @@ Environment variables are read during <code class="docutils literal notranslate" <td><p>When set to true, record batches will be examined between each operator and small batches will be coalesced into larger batches. This is helpful when there are highly selective filters or joins that could produce tiny output batches. The target batch size is determined by the configuration setting</p></td> </tr> <tr class="row-even"><td><p>datafusion.execution.collect_statistics</p></td> -<td><p>false</p></td> -<td><p>Should DataFusion collect statistics when first creating a table. Has no effect after the table is created. Applies to the default <code class="docutils literal notranslate"><span class="pre">ListingTableProvider</span></code> in DataFusion. Defaults to false.</p></td> +<td><p>true</p></td> +<td><p>Should DataFusion collect statistics when first creating a table. Has no effect after the table is created. Applies to the default <code class="docutils literal notranslate"><span class="pre">ListingTableProvider</span></code> in DataFusion. Defaults to true.</p></td> </tr> <tr class="row-odd"><td><p>datafusion.execution.target_partitions</p></td> <td><p>0</p></td> diff --git a/user-guide/sql/ddl.html b/user-guide/sql/ddl.html index d22a883372..a87ad924d2 100644 --- a/user-guide/sql/ddl.html +++ b/user-guide/sql/ddl.html @@ -749,14 +749,14 @@ provide schema information for Parquet files.</p> <div class="admonition note"> <p class="admonition-title">Note</p> <dl class="simple myst"> -<dt>Statistics</dt><dd><p>By default, when a table is created, DataFusion will <em>NOT</em> read the files +<dt>Statistics</dt><dd><p>By default, when a table is created, DataFusion will read the files to gather statistics, which can be expensive but can accelerate subsequent -queries substantially. If you want to gather statistics +queries substantially. If you don’t want to gather statistics when creating a table, set the <code class="docutils literal notranslate"><span class="pre">datafusion.execution.collect_statistics</span></code> -configuration option to <code class="docutils literal notranslate"><span class="pre">true</span></code> before creating the table. For example:</p> +configuration option to <code class="docutils literal notranslate"><span class="pre">false</span></code> before creating the table. For example:</p> </dd> </dl> -<div class="highlight-sql notranslate"><div class="highlight"><pre><span></span><span class="k">SET</span><span class="w"> </span><span class="n">datafusion</span><span class="p">.</span><span class="n">execution</span><span class="p">.</span><span class="n">collect_statistics</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">true</span><span class="p">;</span> +<div class="highlight-sql notranslate"><div class="highlight"><pre><span></span><span class="k">SET</span><span class="w"> </span><span class="n">datafusion</span><span class="p">.</span><span class="n">execution</span><span class="p">.</span><span class="n">collect_statistics</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">false</span><span class="p">;</span> </pre></div> </div> <p>See the <a class="reference internal" href="../configs.html"><span class="std std-doc">config settings docs</span></a> for more details.</p> --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@datafusion.apache.org For additional commands, e-mail: commits-h...@datafusion.apache.org