This is an automated email from the ASF dual-hosted git repository. github-bot pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/datafusion.git
The following commit(s) were added to refs/heads/asf-site by this push: new 61105987b9 Publish built docs triggered by 26106a2aa8cd49802a0178bc2a9ac16b93524456 61105987b9 is described below commit 61105987b9698eac5437ad82d408f10494f29390 Author: github-actions[bot] <github-actions[bot]@users.noreply.github.com> AuthorDate: Tue Sep 9 12:52:40 2025 +0000 Publish built docs triggered by 26106a2aa8cd49802a0178bc2a9ac16b93524456 --- _sources/library-user-guide/upgrading.md.txt | 11 +++ _sources/user-guide/cli/datasources.md.txt | 24 ++++++ _sources/user-guide/configs.md.txt | 1 + _sources/user-guide/sql/ddl.md.txt | 29 +++++++ library-user-guide/upgrading.html | 14 ++++ searchindex.js | 2 +- user-guide/cli/datasources.html | 20 +++++ user-guide/configs.html | 116 ++++++++++++++------------- user-guide/sql/ddl.html | 25 ++++++ 9 files changed, 185 insertions(+), 57 deletions(-) diff --git a/_sources/library-user-guide/upgrading.md.txt b/_sources/library-user-guide/upgrading.md.txt index b9fbc6e1be..6c8eb0ec1e 100644 --- a/_sources/library-user-guide/upgrading.md.txt +++ b/_sources/library-user-guide/upgrading.md.txt @@ -24,6 +24,17 @@ **Note:** DataFusion `50.0.0` has not been released yet. The information provided in this section pertains to features and changes that have already been merged to the main branch and are awaiting release in this version. You can see the current [status of the `50.0.0 `release here](https://github.com/apache/datafusion/issues/16799) +### ListingTable automatically detects Hive Partitioned tables + +DataFusion 50.0.0 automatically infers Hive partitions when using the `ListingTableFactory` and `CREATE EXTERNAL TABLE`. Previously, +when creating a `ListingTable`, datasets that use Hive partitioning (e.g. +`/table_root/column1=value1/column2=value2/data.parquet`) would not have the Hive columns reflected in +the table's schema or data. The previous behavior can be +restored by setting the `datafusion.execution.listing_table_factory_infer_partitions` configuration option to `false`. +See [issue #17049] for more details. + +[issue #17049]: https://github.com/apache/datafusion/issues/17049 + ### `MSRV` updated to 1.86.0 The Minimum Supported Rust Version (MSRV) has been updated to [`1.86.0`]. diff --git a/_sources/user-guide/cli/datasources.md.txt b/_sources/user-guide/cli/datasources.md.txt index c15b8a5e46..6b1a4887a8 100644 --- a/_sources/user-guide/cli/datasources.md.txt +++ b/_sources/user-guide/cli/datasources.md.txt @@ -162,6 +162,30 @@ STORED AS PARQUET LOCATION 'gs://bucket/my_table/'; ``` +When specifying a directory path that has a Hive compliant partition structure, by default, DataFusion CLI will +automatically parse and incorporate the Hive columns and their values into the table's schema and data. Given the +following remote object paths: + +```console +gs://bucket/my_table/a=1/b=100/file1.parquet +gs://bucket/my_table/a=2/b=200/file2.parquet +``` + +`my_table` can be queried and filtered on the Hive columns: + +```sql +CREATE EXTERNAL TABLE my_table +STORED AS PARQUET +LOCATION 'gs://bucket/my_table/'; + +SELECT count(*) FROM my_table WHERE b=200; ++----------+ +| count(*) | ++----------+ +| 1 | ++----------+ +``` + # Formats ## Parquet diff --git a/_sources/user-guide/configs.md.txt b/_sources/user-guide/configs.md.txt index 3d4730958f..f00c4c2acc 100644 --- a/_sources/user-guide/configs.md.txt +++ b/_sources/user-guide/configs.md.txt @@ -118,6 +118,7 @@ The following configuration settings are available: | datafusion.execution.soft_max_rows_per_output_file | 50000000 | Target number of rows in output files when writing multiple. This is a soft max, so it can be exceeded slightly. There also will be one file smaller than the limit if the total number of rows written is not roughly divisible by the soft max [...] | datafusion.execution.max_buffered_batches_per_output_file | 2 | This is the maximum number of RecordBatches buffered for each output file being worked. Higher values can potentially give faster write performance at the cost of higher peak memory consumption [...] | datafusion.execution.listing_table_ignore_subdirectory | true | Should sub directories be ignored when scanning directories for data files. Defaults to true (ignores subdirectories), consistent with Hive. Note that this setting does not affect reading partitioned tables (e.g. `/table/year=2021/month=01/data.parquet`). [...] +| datafusion.execution.listing_table_factory_infer_partitions | true | Should a `ListingTable` created through the `ListingTableFactory` infer table partitions from Hive compliant directories. Defaults to true (partition columns are inferred and will be represented in the table schema). [...] | datafusion.execution.enable_recursive_ctes | true | Should DataFusion support recursive CTEs [...] | datafusion.execution.split_file_groups_by_statistics | false | Attempt to eliminate sorts by packing & sorting files with non-overlapping statistics into the same file groups. Currently experimental [...] | datafusion.execution.keep_partition_by_columns | false | Should DataFusion keep the columns used for partition_by in the output RecordBatches [...] diff --git a/_sources/user-guide/sql/ddl.md.txt b/_sources/user-guide/sql/ddl.md.txt index b5028cca49..bd41f691bf 100644 --- a/_sources/user-guide/sql/ddl.md.txt +++ b/_sources/user-guide/sql/ddl.md.txt @@ -169,6 +169,35 @@ LOCATION '/path/to/directory/of/files' OPTIONS ('has_header' 'true'); ``` +Tables that are partitioned using a Hive compliant partitioning scheme will have their columns and values automatically +detected and incorporated into the table's schema and data. Given the following example directory structure: + +```console +hive_partitioned/ +├── a=1 +│ └── b=200 +│ └── file1.parquet +└── a=2 + └── b=100 + └── file2.parquet +``` + +Users can specify the top level `hive_partitioned` directory as an `EXTERNAL TABLE` and leverage the Hive partitions to query +and filter data. + +```sql +CREATE EXTERNAL TABLE hive_partitioned +STORED AS PARQUET +LOCATION '/path/to/hive_partitioned/'; + +SELECT count(*) FROM hive_partitioned WHERE b=100; ++------------------+ +| count(*) | ++------------------+ +| 1 | ++------------------+ +``` + ### Example: Unbounded Data Sources We can create unbounded data sources using the `CREATE UNBOUNDED EXTERNAL TABLE` SQL statement. diff --git a/library-user-guide/upgrading.html b/library-user-guide/upgrading.html index abd7dfd257..642e219edb 100644 --- a/library-user-guide/upgrading.html +++ b/library-user-guide/upgrading.html @@ -559,6 +559,11 @@ </code> </a> <ul class="nav section-nav flex-column"> + <li class="toc-h3 nav-item toc-entry"> + <a class="reference internal nav-link" href="#listingtable-automatically-detects-hive-partitioned-tables"> + ListingTable automatically detects Hive Partitioned tables + </a> + </li> <li class="toc-h3 nav-item toc-entry"> <a class="reference internal nav-link" href="#msrv-updated-to-1-86-0"> <code class="docutils literal notranslate"> @@ -1299,6 +1304,15 @@ <h2>DataFusion <code class="docutils literal notranslate"><span class="pre">50.0.0</span></code><a class="headerlink" href="#datafusion-50-0-0" title="Link to this heading">¶</a></h2> <p><strong>Note:</strong> DataFusion <code class="docutils literal notranslate"><span class="pre">50.0.0</span></code> has not been released yet. The information provided in this section pertains to features and changes that have already been merged to the main branch and are awaiting release in this version. You can see the current <a class="reference external" href="https://github.com/apache/datafusion/issues/16799">status of the <code class="docutils literal notranslate"><span class="pre">50.0.0</span> </code>release here</a></p> +<section id="listingtable-automatically-detects-hive-partitioned-tables"> +<h3>ListingTable automatically detects Hive Partitioned tables<a class="headerlink" href="#listingtable-automatically-detects-hive-partitioned-tables" title="Link to this heading">¶</a></h3> +<p>DataFusion 50.0.0 automatically infers Hive partitions when using the <code class="docutils literal notranslate"><span class="pre">ListingTableFactory</span></code> and <code class="docutils literal notranslate"><span class="pre">CREATE</span> <span class="pre">EXTERNAL</span> <span class="pre">TABLE</span></code>. Previously, +when creating a <code class="docutils literal notranslate"><span class="pre">ListingTable</span></code>, datasets that use Hive partitioning (e.g. +<code class="docutils literal notranslate"><span class="pre">/table_root/column1=value1/column2=value2/data.parquet</span></code>) would not have the Hive columns reflected in +the table’s schema or data. The previous behavior can be +restored by setting the <code class="docutils literal notranslate"><span class="pre">datafusion.execution.listing_table_factory_infer_partitions</span></code> configuration option to <code class="docutils literal notranslate"><span class="pre">false</span></code>. +See <a class="reference external" href="https://github.com/apache/datafusion/issues/17049">issue #17049</a> for more details.</p> +</section> <section id="msrv-updated-to-1-86-0"> <h3><code class="docutils literal notranslate"><span class="pre">MSRV</span></code> updated to 1.86.0<a class="headerlink" href="#msrv-updated-to-1-86-0" title="Link to this heading">¶</a></h3> <p>The Minimum Supported Rust Version (MSRV) has been updated to <a class="reference external" href="https://releases.rs/docs/1.86.0/"><code class="docutils literal notranslate"><span class="pre">1.86.0</span></code></a>. diff --git a/searchindex.js b/searchindex.js index 4601dbfba5..e0ac9f0117 100644 --- a/searchindex.js +++ b/searchindex.js @@ -1 +1 @@ -Search.setIndex({"alltitles":{"!=":[[57,"op-neq"]],"!~":[[57,"op-re-not-match"]],"!~*":[[57,"op-re-not-match-i"]],"!~~":[[57,"id19"]],"!~~*":[[57,"id20"]],"#":[[57,"op-bit-xor"]],"%":[[57,"op-modulo"]],"&":[[57,"op-bit-and"]],"(relation, name) tuples in logical fields and logical columns are unique":[[12,"relation-name-tuples-in-logical-fields-and-logical-columns-are-unique"]],"*":[[57,"op-multiply"]],"+":[[57,"op-plus"]],"-":[[57,"op-minus"]],"/":[[57,"op-divide"]],"<":[[57,"op-lt"]],"< [...] \ No newline at end of file +Search.setIndex({"alltitles":{"!=":[[57,"op-neq"]],"!~":[[57,"op-re-not-match"]],"!~*":[[57,"op-re-not-match-i"]],"!~~":[[57,"id19"]],"!~~*":[[57,"id20"]],"#":[[57,"op-bit-xor"]],"%":[[57,"op-modulo"]],"&":[[57,"op-bit-and"]],"(relation, name) tuples in logical fields and logical columns are unique":[[12,"relation-name-tuples-in-logical-fields-and-logical-columns-are-unique"]],"*":[[57,"op-multiply"]],"+":[[57,"op-plus"]],"-":[[57,"op-minus"]],"/":[[57,"op-divide"]],"<":[[57,"op-lt"]],"< [...] \ No newline at end of file diff --git a/user-guide/cli/datasources.html b/user-guide/cli/datasources.html index 2587da596d..c74abfd37f 100644 --- a/user-guide/cli/datasources.html +++ b/user-guide/cli/datasources.html @@ -797,6 +797,26 @@ path, and it will automatically read all compatible files within that directory. <span class="k">LOCATION</span><span class="w"> </span><span class="s1">'gs://bucket/my_table/'</span><span class="p">;</span> </pre></div> </div> +<p>When specifying a directory path that has a Hive compliant partition structure, by default, DataFusion CLI will +automatically parse and incorporate the Hive columns and their values into the table’s schema and data. Given the +following remote object paths:</p> +<div class="highlight-console notranslate"><div class="highlight"><pre><span></span><span class="go">gs://bucket/my_table/a=1/b=100/file1.parquet</span> +<span class="go">gs://bucket/my_table/a=2/b=200/file2.parquet</span> +</pre></div> +</div> +<p><code class="docutils literal notranslate"><span class="pre">my_table</span></code> can be queried and filtered on the Hive columns:</p> +<div class="highlight-sql notranslate"><div class="highlight"><pre><span></span><span class="k">CREATE</span><span class="w"> </span><span class="k">EXTERNAL</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">my_table</span> +<span class="n">STORED</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">PARQUET</span> +<span class="k">LOCATION</span><span class="w"> </span><span class="s1">'gs://bucket/my_table/'</span><span class="p">;</span> + +<span class="k">SELECT</span><span class="w"> </span><span class="k">count</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">my_table</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">b</span><span class="o">=</span><span class="mi">200</span><span class="p">;</span> +<span class="o">+</span><span class="c1">----------+</span> +<span class="o">|</span><span class="w"> </span><span class="k">count</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="o">|</span> +<span class="o">+</span><span class="c1">----------+</span> +<span class="o">|</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="o">|</span> +<span class="o">+</span><span class="c1">----------+</span> +</pre></div> +</div> </section> <section id="formats"> <h1>Formats<a class="headerlink" href="#formats" title="Link to this heading">¶</a></h1> diff --git a/user-guide/configs.html b/user-guide/configs.html index d9b6cd4d05..97dd0b04c5 100644 --- a/user-guide/configs.html +++ b/user-guide/configs.html @@ -902,227 +902,231 @@ example, to configure <code class="docutils literal notranslate"><span class="pr <td><p>true</p></td> <td><p>Should sub directories be ignored when scanning directories for data files. Defaults to true (ignores subdirectories), consistent with Hive. Note that this setting does not affect reading partitioned tables (e.g. <code class="docutils literal notranslate"><span class="pre">/table/year=2021/month=01/data.parquet</span></code>).</p></td> </tr> -<tr class="row-odd"><td><p>datafusion.execution.enable_recursive_ctes</p></td> +<tr class="row-odd"><td><p>datafusion.execution.listing_table_factory_infer_partitions</p></td> +<td><p>true</p></td> +<td><p>Should a <code class="docutils literal notranslate"><span class="pre">ListingTable</span></code> created through the <code class="docutils literal notranslate"><span class="pre">ListingTableFactory</span></code> infer table partitions from Hive compliant directories. Defaults to true (partition columns are inferred and will be represented in the table schema).</p></td> +</tr> +<tr class="row-even"><td><p>datafusion.execution.enable_recursive_ctes</p></td> <td><p>true</p></td> <td><p>Should DataFusion support recursive CTEs</p></td> </tr> -<tr class="row-even"><td><p>datafusion.execution.split_file_groups_by_statistics</p></td> +<tr class="row-odd"><td><p>datafusion.execution.split_file_groups_by_statistics</p></td> <td><p>false</p></td> <td><p>Attempt to eliminate sorts by packing & sorting files with non-overlapping statistics into the same file groups. Currently experimental</p></td> </tr> -<tr class="row-odd"><td><p>datafusion.execution.keep_partition_by_columns</p></td> +<tr class="row-even"><td><p>datafusion.execution.keep_partition_by_columns</p></td> <td><p>false</p></td> <td><p>Should DataFusion keep the columns used for partition_by in the output RecordBatches</p></td> </tr> -<tr class="row-even"><td><p>datafusion.execution.skip_partial_aggregation_probe_ratio_threshold</p></td> +<tr class="row-odd"><td><p>datafusion.execution.skip_partial_aggregation_probe_ratio_threshold</p></td> <td><p>0.8</p></td> <td><p>Aggregation ratio (number of distinct groups / number of input rows) threshold for skipping partial aggregation. If the value is greater then partial aggregation will skip aggregation for further input</p></td> </tr> -<tr class="row-odd"><td><p>datafusion.execution.skip_partial_aggregation_probe_rows_threshold</p></td> +<tr class="row-even"><td><p>datafusion.execution.skip_partial_aggregation_probe_rows_threshold</p></td> <td><p>100000</p></td> <td><p>Number of input rows partial aggregation partition should process, before aggregation ratio check and trying to switch to skipping aggregation mode</p></td> </tr> -<tr class="row-even"><td><p>datafusion.execution.use_row_number_estimates_to_optimize_partitioning</p></td> +<tr class="row-odd"><td><p>datafusion.execution.use_row_number_estimates_to_optimize_partitioning</p></td> <td><p>false</p></td> <td><p>Should DataFusion use row number estimates at the input to decide whether increasing parallelism is beneficial or not. By default, only exact row numbers (not estimates) are used for this decision. Setting this flag to <code class="docutils literal notranslate"><span class="pre">true</span></code> will likely produce better plans. if the source of statistics is accurate. We plan to make this the default in the future.</p></td> </tr> -<tr class="row-odd"><td><p>datafusion.execution.enforce_batch_size_in_joins</p></td> +<tr class="row-even"><td><p>datafusion.execution.enforce_batch_size_in_joins</p></td> <td><p>false</p></td> <td><p>Should DataFusion enforce batch size in joins or not. By default, DataFusion will not enforce batch size in joins. Enforcing batch size in joins can reduce memory usage when joining large tables with a highly-selective join filter, but is also slightly slower.</p></td> </tr> -<tr class="row-even"><td><p>datafusion.execution.objectstore_writer_buffer_size</p></td> +<tr class="row-odd"><td><p>datafusion.execution.objectstore_writer_buffer_size</p></td> <td><p>10485760</p></td> <td><p>Size (bytes) of data buffer DataFusion uses when writing output files. This affects the size of the data chunks that are uploaded to remote object stores (e.g. AWS S3). If very large (>= 100 GiB) output files are being written, it may be necessary to increase this size to avoid errors from the remote end point.</p></td> </tr> -<tr class="row-odd"><td><p>datafusion.optimizer.enable_distinct_aggregation_soft_limit</p></td> +<tr class="row-even"><td><p>datafusion.optimizer.enable_distinct_aggregation_soft_limit</p></td> <td><p>true</p></td> <td><p>When set to true, the optimizer will push a limit operation into grouped aggregations which have no aggregate expressions, as a soft limit, emitting groups once the limit is reached, before all rows in the group are read.</p></td> </tr> -<tr class="row-even"><td><p>datafusion.optimizer.enable_round_robin_repartition</p></td> +<tr class="row-odd"><td><p>datafusion.optimizer.enable_round_robin_repartition</p></td> <td><p>true</p></td> <td><p>When set to true, the physical plan optimizer will try to add round robin repartitioning to increase parallelism to leverage more CPU cores</p></td> </tr> -<tr class="row-odd"><td><p>datafusion.optimizer.enable_topk_aggregation</p></td> +<tr class="row-even"><td><p>datafusion.optimizer.enable_topk_aggregation</p></td> <td><p>true</p></td> <td><p>When set to true, the optimizer will attempt to perform limit operations during aggregations, if possible</p></td> </tr> -<tr class="row-even"><td><p>datafusion.optimizer.enable_window_limits</p></td> +<tr class="row-odd"><td><p>datafusion.optimizer.enable_window_limits</p></td> <td><p>true</p></td> <td><p>When set to true, the optimizer will attempt to push limit operations past window functions, if possible</p></td> </tr> -<tr class="row-odd"><td><p>datafusion.optimizer.enable_dynamic_filter_pushdown</p></td> +<tr class="row-even"><td><p>datafusion.optimizer.enable_dynamic_filter_pushdown</p></td> <td><p>true</p></td> <td><p>When set to true attempts to push down dynamic filters generated by operators into the file scan phase. For example, for a query such as <code class="docutils literal notranslate"><span class="pre">SELECT</span> <span class="pre">*</span> <span class="pre">FROM</span> <span class="pre">t</span> <span class="pre">ORDER</span> <span class="pre">BY</span> <span class="pre">timestamp</span> <span class="pre">DESC</span> <span class="pre">LIMIT</span> <span class="pre">10</span></code> [...] </tr> -<tr class="row-even"><td><p>datafusion.optimizer.filter_null_join_keys</p></td> +<tr class="row-odd"><td><p>datafusion.optimizer.filter_null_join_keys</p></td> <td><p>false</p></td> <td><p>When set to true, the optimizer will insert filters before a join between a nullable and non-nullable column to filter out nulls on the nullable side. This filter can add additional overhead when the file format does not fully support predicate push down.</p></td> </tr> -<tr class="row-odd"><td><p>datafusion.optimizer.repartition_aggregations</p></td> +<tr class="row-even"><td><p>datafusion.optimizer.repartition_aggregations</p></td> <td><p>true</p></td> <td><p>Should DataFusion repartition data using the aggregate keys to execute aggregates in parallel using the provided <code class="docutils literal notranslate"><span class="pre">target_partitions</span></code> level</p></td> </tr> -<tr class="row-even"><td><p>datafusion.optimizer.repartition_file_min_size</p></td> +<tr class="row-odd"><td><p>datafusion.optimizer.repartition_file_min_size</p></td> <td><p>10485760</p></td> <td><p>Minimum total files size in bytes to perform file scan repartitioning.</p></td> </tr> -<tr class="row-odd"><td><p>datafusion.optimizer.repartition_joins</p></td> +<tr class="row-even"><td><p>datafusion.optimizer.repartition_joins</p></td> <td><p>true</p></td> <td><p>Should DataFusion repartition data using the join keys to execute joins in parallel using the provided <code class="docutils literal notranslate"><span class="pre">target_partitions</span></code> level</p></td> </tr> -<tr class="row-even"><td><p>datafusion.optimizer.allow_symmetric_joins_without_pruning</p></td> +<tr class="row-odd"><td><p>datafusion.optimizer.allow_symmetric_joins_without_pruning</p></td> <td><p>true</p></td> <td><p>Should DataFusion allow symmetric hash joins for unbounded data sources even when its inputs do not have any ordering or filtering If the flag is not enabled, the SymmetricHashJoin operator will be unable to prune its internal buffers, resulting in certain join types - such as Full, Left, LeftAnti, LeftSemi, Right, RightAnti, and RightSemi - being produced only at the end of the execution. This is not typical in stream processing. Additionally, without proper design for long runne [...] </tr> -<tr class="row-odd"><td><p>datafusion.optimizer.repartition_file_scans</p></td> +<tr class="row-even"><td><p>datafusion.optimizer.repartition_file_scans</p></td> <td><p>true</p></td> <td><p>When set to <code class="docutils literal notranslate"><span class="pre">true</span></code>, datasource partitions will be repartitioned to achieve maximum parallelism. This applies to both in-memory partitions and FileSource’s file groups (1 group is 1 partition). For FileSources, only Parquet and CSV formats are currently supported. If set to <code class="docutils literal notranslate"><span class="pre">true</span></code> for a FileSource, all files will be repartitioned evenly ( [...] </tr> -<tr class="row-even"><td><p>datafusion.optimizer.repartition_windows</p></td> +<tr class="row-odd"><td><p>datafusion.optimizer.repartition_windows</p></td> <td><p>true</p></td> <td><p>Should DataFusion repartition data using the partitions keys to execute window functions in parallel using the provided <code class="docutils literal notranslate"><span class="pre">target_partitions</span></code> level</p></td> </tr> -<tr class="row-odd"><td><p>datafusion.optimizer.repartition_sorts</p></td> +<tr class="row-even"><td><p>datafusion.optimizer.repartition_sorts</p></td> <td><p>true</p></td> <td><p>Should DataFusion execute sorts in a per-partition fashion and merge afterwards instead of coalescing first and sorting globally. With this flag is enabled, plans in the form below <code class="docutils literal notranslate"><span class="pre">text</span> <span class="pre">"SortExec:</span> <span class="pre">[a@0</span> <span class="pre">ASC]",</span> <span class="pre">"</span> <span class="pre">CoalescePartitionsExec",</span> <span class="pre">"</span> [...] </tr> -<tr class="row-even"><td><p>datafusion.optimizer.prefer_existing_sort</p></td> +<tr class="row-odd"><td><p>datafusion.optimizer.prefer_existing_sort</p></td> <td><p>false</p></td> <td><p>When true, DataFusion will opportunistically remove sorts when the data is already sorted, (i.e. setting <code class="docutils literal notranslate"><span class="pre">preserve_order</span></code> to true on <code class="docutils literal notranslate"><span class="pre">RepartitionExec</span></code> and using <code class="docutils literal notranslate"><span class="pre">SortPreservingMergeExec</span></code>) When false, DataFusion will maximize plan parallelism using <code class="docut [...] </tr> -<tr class="row-odd"><td><p>datafusion.optimizer.skip_failed_rules</p></td> +<tr class="row-even"><td><p>datafusion.optimizer.skip_failed_rules</p></td> <td><p>false</p></td> <td><p>When set to true, the logical plan optimizer will produce warning messages if any optimization rules produce errors and then proceed to the next rule. When set to false, any rules that produce errors will cause the query to fail</p></td> </tr> -<tr class="row-even"><td><p>datafusion.optimizer.max_passes</p></td> +<tr class="row-odd"><td><p>datafusion.optimizer.max_passes</p></td> <td><p>3</p></td> <td><p>Number of times that the optimizer will attempt to optimize the plan</p></td> </tr> -<tr class="row-odd"><td><p>datafusion.optimizer.top_down_join_key_reordering</p></td> +<tr class="row-even"><td><p>datafusion.optimizer.top_down_join_key_reordering</p></td> <td><p>true</p></td> <td><p>When set to true, the physical plan optimizer will run a top down process to reorder the join keys</p></td> </tr> -<tr class="row-even"><td><p>datafusion.optimizer.prefer_hash_join</p></td> +<tr class="row-odd"><td><p>datafusion.optimizer.prefer_hash_join</p></td> <td><p>true</p></td> <td><p>When set to true, the physical plan optimizer will prefer HashJoin over SortMergeJoin. HashJoin can work more efficiently than SortMergeJoin but consumes more memory</p></td> </tr> -<tr class="row-odd"><td><p>datafusion.optimizer.hash_join_single_partition_threshold</p></td> +<tr class="row-even"><td><p>datafusion.optimizer.hash_join_single_partition_threshold</p></td> <td><p>1048576</p></td> <td><p>The maximum estimated size in bytes for one input side of a HashJoin will be collected into a single partition</p></td> </tr> -<tr class="row-even"><td><p>datafusion.optimizer.hash_join_single_partition_threshold_rows</p></td> +<tr class="row-odd"><td><p>datafusion.optimizer.hash_join_single_partition_threshold_rows</p></td> <td><p>131072</p></td> <td><p>The maximum estimated size in rows for one input side of a HashJoin will be collected into a single partition</p></td> </tr> -<tr class="row-odd"><td><p>datafusion.optimizer.default_filter_selectivity</p></td> +<tr class="row-even"><td><p>datafusion.optimizer.default_filter_selectivity</p></td> <td><p>20</p></td> <td><p>The default filter selectivity used by Filter Statistics when an exact selectivity cannot be determined. Valid values are between 0 (no selectivity) and 100 (all rows are selected).</p></td> </tr> -<tr class="row-even"><td><p>datafusion.optimizer.prefer_existing_union</p></td> +<tr class="row-odd"><td><p>datafusion.optimizer.prefer_existing_union</p></td> <td><p>false</p></td> <td><p>When set to true, the optimizer will not attempt to convert Union to Interleave</p></td> </tr> -<tr class="row-odd"><td><p>datafusion.optimizer.expand_views_at_output</p></td> +<tr class="row-even"><td><p>datafusion.optimizer.expand_views_at_output</p></td> <td><p>false</p></td> <td><p>When set to true, if the returned type is a view type then the output will be coerced to a non-view. Coerces <code class="docutils literal notranslate"><span class="pre">Utf8View</span></code> to <code class="docutils literal notranslate"><span class="pre">LargeUtf8</span></code>, and <code class="docutils literal notranslate"><span class="pre">BinaryView</span></code> to <code class="docutils literal notranslate"><span class="pre">LargeBinary</span></code>.</p></td> </tr> -<tr class="row-even"><td><p>datafusion.explain.logical_plan_only</p></td> +<tr class="row-odd"><td><p>datafusion.explain.logical_plan_only</p></td> <td><p>false</p></td> <td><p>When set to true, the explain statement will only print logical plans</p></td> </tr> -<tr class="row-odd"><td><p>datafusion.explain.physical_plan_only</p></td> +<tr class="row-even"><td><p>datafusion.explain.physical_plan_only</p></td> <td><p>false</p></td> <td><p>When set to true, the explain statement will only print physical plans</p></td> </tr> -<tr class="row-even"><td><p>datafusion.explain.show_statistics</p></td> +<tr class="row-odd"><td><p>datafusion.explain.show_statistics</p></td> <td><p>false</p></td> <td><p>When set to true, the explain statement will print operator statistics for physical plans</p></td> </tr> -<tr class="row-odd"><td><p>datafusion.explain.show_sizes</p></td> +<tr class="row-even"><td><p>datafusion.explain.show_sizes</p></td> <td><p>true</p></td> <td><p>When set to true, the explain statement will print the partition sizes</p></td> </tr> -<tr class="row-even"><td><p>datafusion.explain.show_schema</p></td> +<tr class="row-odd"><td><p>datafusion.explain.show_schema</p></td> <td><p>false</p></td> <td><p>When set to true, the explain statement will print schema information</p></td> </tr> -<tr class="row-odd"><td><p>datafusion.explain.format</p></td> +<tr class="row-even"><td><p>datafusion.explain.format</p></td> <td><p>indent</p></td> <td><p>Display format of explain. Default is “indent”. When set to “tree”, it will print the plan in a tree-rendered format.</p></td> </tr> -<tr class="row-even"><td><p>datafusion.explain.tree_maximum_render_width</p></td> +<tr class="row-odd"><td><p>datafusion.explain.tree_maximum_render_width</p></td> <td><p>240</p></td> <td><p>(format=tree only) Maximum total width of the rendered tree. When set to 0, the tree will have no width limit.</p></td> </tr> -<tr class="row-odd"><td><p>datafusion.sql_parser.parse_float_as_decimal</p></td> +<tr class="row-even"><td><p>datafusion.sql_parser.parse_float_as_decimal</p></td> <td><p>false</p></td> <td><p>When set to true, SQL parser will parse float as decimal type</p></td> </tr> -<tr class="row-even"><td><p>datafusion.sql_parser.enable_ident_normalization</p></td> +<tr class="row-odd"><td><p>datafusion.sql_parser.enable_ident_normalization</p></td> <td><p>true</p></td> <td><p>When set to true, SQL parser will normalize ident (convert ident to lowercase when not quoted)</p></td> </tr> -<tr class="row-odd"><td><p>datafusion.sql_parser.enable_options_value_normalization</p></td> +<tr class="row-even"><td><p>datafusion.sql_parser.enable_options_value_normalization</p></td> <td><p>false</p></td> <td><p>When set to true, SQL parser will normalize options value (convert value to lowercase). Note that this option is ignored and will be removed in the future. All case-insensitive values are normalized automatically.</p></td> </tr> -<tr class="row-even"><td><p>datafusion.sql_parser.dialect</p></td> +<tr class="row-odd"><td><p>datafusion.sql_parser.dialect</p></td> <td><p>generic</p></td> <td><p>Configure the SQL dialect used by DataFusion’s parser; supported values include: Generic, MySQL, PostgreSQL, Hive, SQLite, Snowflake, Redshift, MsSQL, ClickHouse, BigQuery, Ansi, DuckDB and Databricks.</p></td> </tr> -<tr class="row-odd"><td><p>datafusion.sql_parser.support_varchar_with_length</p></td> +<tr class="row-even"><td><p>datafusion.sql_parser.support_varchar_with_length</p></td> <td><p>true</p></td> <td><p>If true, permit lengths for <code class="docutils literal notranslate"><span class="pre">VARCHAR</span></code> such as <code class="docutils literal notranslate"><span class="pre">VARCHAR(20)</span></code>, but ignore the length. If false, error if a <code class="docutils literal notranslate"><span class="pre">VARCHAR</span></code> with a length is specified. The Arrow type system does not have a notion of maximum string length and thus DataFusion can not enforce such limits.</p></td> </tr> -<tr class="row-even"><td><p>datafusion.sql_parser.map_string_types_to_utf8view</p></td> +<tr class="row-odd"><td><p>datafusion.sql_parser.map_string_types_to_utf8view</p></td> <td><p>true</p></td> <td><p>If true, string types (VARCHAR, CHAR, Text, and String) are mapped to <code class="docutils literal notranslate"><span class="pre">Utf8View</span></code> during SQL planning. If false, they are mapped to <code class="docutils literal notranslate"><span class="pre">Utf8</span></code>. Default is true.</p></td> </tr> -<tr class="row-odd"><td><p>datafusion.sql_parser.collect_spans</p></td> +<tr class="row-even"><td><p>datafusion.sql_parser.collect_spans</p></td> <td><p>false</p></td> <td><p>When set to true, the source locations relative to the original SQL query (i.e. <a class="reference external" href="https://docs.rs/sqlparser/latest/sqlparser/tokenizer/struct.Span.html"><code class="docutils literal notranslate"><span class="pre">Span</span></code></a>) will be collected and recorded in the logical plan nodes.</p></td> </tr> -<tr class="row-even"><td><p>datafusion.sql_parser.recursion_limit</p></td> +<tr class="row-odd"><td><p>datafusion.sql_parser.recursion_limit</p></td> <td><p>50</p></td> <td><p>Specifies the recursion depth limit when parsing complex SQL Queries</p></td> </tr> -<tr class="row-odd"><td><p>datafusion.sql_parser.default_null_ordering</p></td> +<tr class="row-even"><td><p>datafusion.sql_parser.default_null_ordering</p></td> <td><p>nulls_max</p></td> <td><p>Specifies the default null ordering for query results. There are 4 options: - <code class="docutils literal notranslate"><span class="pre">nulls_max</span></code>: Nulls appear last in ascending order. - <code class="docutils literal notranslate"><span class="pre">nulls_min</span></code>: Nulls appear first in ascending order. - <code class="docutils literal notranslate"><span class="pre">nulls_first</span></code>: Nulls always be first in any order. - <code class="docutils litera [...] </tr> -<tr class="row-even"><td><p>datafusion.format.safe</p></td> +<tr class="row-odd"><td><p>datafusion.format.safe</p></td> <td><p>true</p></td> <td><p>If set to <code class="docutils literal notranslate"><span class="pre">true</span></code> any formatting errors will be written to the output instead of being converted into a [<code class="docutils literal notranslate"><span class="pre">std::fmt::Error</span></code>]</p></td> </tr> -<tr class="row-odd"><td><p>datafusion.format.null</p></td> +<tr class="row-even"><td><p>datafusion.format.null</p></td> <td><p></p></td> <td><p>Format string for nulls</p></td> </tr> -<tr class="row-even"><td><p>datafusion.format.date_format</p></td> +<tr class="row-odd"><td><p>datafusion.format.date_format</p></td> <td><p>%Y-%m-%d</p></td> <td><p>Date format for date arrays</p></td> </tr> -<tr class="row-odd"><td><p>datafusion.format.datetime_format</p></td> +<tr class="row-even"><td><p>datafusion.format.datetime_format</p></td> <td><p>%Y-%m-%dT%H:%M:%S%.f</p></td> <td><p>Format for DateTime arrays</p></td> </tr> -<tr class="row-even"><td><p>datafusion.format.timestamp_format</p></td> +<tr class="row-odd"><td><p>datafusion.format.timestamp_format</p></td> <td><p>%Y-%m-%dT%H:%M:%S%.f</p></td> <td><p>Timestamp format for timestamp arrays</p></td> </tr> -<tr class="row-odd"><td><p>datafusion.format.timestamp_tz_format</p></td> +<tr class="row-even"><td><p>datafusion.format.timestamp_tz_format</p></td> <td><p>NULL</p></td> <td><p>Timestamp format for timestamp with timezone arrays. When <code class="docutils literal notranslate"><span class="pre">None</span></code>, ISO 8601 format is used.</p></td> </tr> -<tr class="row-even"><td><p>datafusion.format.time_format</p></td> +<tr class="row-odd"><td><p>datafusion.format.time_format</p></td> <td><p>%H:%M:%S%.f</p></td> <td><p>Time format for time arrays</p></td> </tr> -<tr class="row-odd"><td><p>datafusion.format.duration_format</p></td> +<tr class="row-even"><td><p>datafusion.format.duration_format</p></td> <td><p>pretty</p></td> <td><p>Duration format. Can be either <code class="docutils literal notranslate"><span class="pre">"pretty"</span></code> or <code class="docutils literal notranslate"><span class="pre">"ISO8601"</span></code></p></td> </tr> -<tr class="row-even"><td><p>datafusion.format.types_info</p></td> +<tr class="row-odd"><td><p>datafusion.format.types_info</p></td> <td><p>false</p></td> <td><p>Show types in visual representation batches</p></td> </tr> diff --git a/user-guide/sql/ddl.html b/user-guide/sql/ddl.html index 0f6a32df3a..98c3880001 100644 --- a/user-guide/sql/ddl.html +++ b/user-guide/sql/ddl.html @@ -840,6 +840,31 @@ table (multiple files with the same schema)</p> <span class="k">OPTIONS</span><span class="w"> </span><span class="p">(</span><span class="s1">'has_header'</span><span class="w"> </span><span class="s1">'true'</span><span class="p">);</span> </pre></div> </div> +<p>Tables that are partitioned using a Hive compliant partitioning scheme will have their columns and values automatically +detected and incorporated into the table’s schema and data. Given the following example directory structure:</p> +<div class="highlight-console notranslate"><div class="highlight"><pre><span></span><span class="go">hive_partitioned/</span> +<span class="go">├── a=1</span> +<span class="go">│ └── b=200</span> +<span class="go">│ └── file1.parquet</span> +<span class="go">└── a=2</span> +<span class="go"> └── b=100</span> +<span class="go"> └── file2.parquet</span> +</pre></div> +</div> +<p>Users can specify the top level <code class="docutils literal notranslate"><span class="pre">hive_partitioned</span></code> directory as an <code class="docutils literal notranslate"><span class="pre">EXTERNAL</span> <span class="pre">TABLE</span></code> and leverage the Hive partitions to query +and filter data.</p> +<div class="highlight-sql notranslate"><div class="highlight"><pre><span></span><span class="k">CREATE</span><span class="w"> </span><span class="k">EXTERNAL</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">hive_partitioned</span> +<span class="n">STORED</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">PARQUET</span> +<span class="k">LOCATION</span><span class="w"> </span><span class="s1">'/path/to/hive_partitioned/'</span><span class="p">;</span> + +<span class="k">SELECT</span><span class="w"> </span><span class="k">count</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">hive_partitioned</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">b</span><span class="o">=</span><span class="mi">100</span><span class="p">;</span> +<span class="o">+</span><span class="c1">------------------+</span> +<span class="o">|</span><span class="w"> </span><span class="k">count</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="o">|</span> +<span class="o">+</span><span class="c1">------------------+</span> +<span class="o">|</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="o">|</span> +<span class="o">+</span><span class="c1">------------------+</span> +</pre></div> +</div> </section> <section id="example-unbounded-data-sources"> <h3>Example: Unbounded Data Sources<a class="headerlink" href="#example-unbounded-data-sources" title="Link to this heading">¶</a></h3> --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@datafusion.apache.org For additional commands, e-mail: commits-h...@datafusion.apache.org