(arrow-datafusion) branch asf-site updated: Publish built docs triggered by 1dcdcd431187178d736cdd3a6c004204aa2faa14

github-bot Sun, 14 Jan 2024 11:36:21 -0800

This is an automated email from the ASF dual-hosted git repository.

github-bot pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/arrow-datafusion.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 0b95c13b04 Publish built docs triggered by 
1dcdcd431187178d736cdd3a6c004204aa2faa14
0b95c13b04 is described below

commit 0b95c13b04d2c3248a03130f135c3e2282384ea3
Author: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
AuthorDate: Sun Jan 14 19:36:11 2024 +0000

    Publish built docs triggered by 1dcdcd431187178d736cdd3a6c004204aa2faa14
---
 _sources/user-guide/cli.md.txt |  69 +++++++++++++++++++-
 searchindex.js                 |   2 +-
 user-guide/cli.html            | 142 ++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 209 insertions(+), 4 deletions(-)

diff --git a/_sources/user-guide/cli.md.txt b/_sources/user-guide/cli.md.txt
index 525ab090ce..95b3e7125c 100644
--- a/_sources/user-guide/cli.md.txt
+++ b/_sources/user-guide/cli.md.txt
@@ -191,7 +191,7 @@ DataFusion CLI v16.0.0
 2 rows in set. Query took 0.007 seconds.
 ```
 
-## Creating external tables
+## Creating External Tables
 
 It is also possible to create a table backed by files by explicitly
 via `CREATE EXTERNAL TABLE` as shown below. Filemask wildcards supported
@@ -425,6 +425,13 @@ Available commands inside DataFusion CLI are:
 > \h function
 ```
 
+## Supported SQL
+
+In addition to the normal [SQL supported in DataFusion], `datafusion-cli` also
+supports additional statements and commands:
+
+[sql supported in datafusion]: sql/index.rst
+
 - Show configuration options
 
 `SHOW ALL [VERBOSE]`
@@ -467,6 +474,66 @@ Available commands inside DataFusion CLI are:
 > SET datafusion.execution.batch_size to 1024;
 ```
 
+- `parquet_metadata` table function
+
+The `parquet_metadata` table function can be used to inspect detailed metadata
+about a parquet file such as statistics, sizes, and other information. This can
+be helpful to understand how parquet files are structured.
+
+For example, to see information about the `"WatchID"` column in the
+`hits.parquet` file, you can use:
+
+```sql
+SELECT path_in_schema, row_group_id, row_group_num_rows, stats_min, stats_max, 
total_compressed_size
+FROM parquet_metadata('hits.parquet')
+WHERE path_in_schema = '"WatchID"'
+LIMIT 3;
+
++----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
+| path_in_schema | row_group_id | row_group_num_rows | stats_min           | 
stats_max           | total_compressed_size |
++----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
+| "WatchID"      | 0            | 450560             | 4611687214012840539 | 
9223369186199968220 | 3883759               |
+| "WatchID"      | 1            | 612174             | 4611689135232456464 | 
9223371478009085789 | 5176803               |
+| "WatchID"      | 2            | 344064             | 4611692774829951781 | 
9223363791697310021 | 3031680               |
++----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
+3 rows in set. Query took 0.053 seconds.
+```
+
+The returned table has the following columns for each row for each column chunk
+in the file. Please refer to the [Parquet Documentation] for more information.
+
+[parquet documentation]: https://parquet.apache.org/
+
+| column_name             | data_type | Description                            
                                                             |
+| ----------------------- | --------- | 
---------------------------------------------------------------------------------------------------
 |
+| filename                | Utf8      | Name of the file                       
                                                             |
+| row_group_id            | Int64     | Row group index the column chunk 
belongs to                                                         |
+| row_group_num_rows      | Int64     | Count of rows stored in the row group  
                                                             |
+| row_group_num_columns   | Int64     | Total number of columns in the row 
group (same for all row groups)                                  |
+| row_group_bytes         | Int64     | Number of bytes used to store the row 
group (not including metadata)                                |
+| column_id               | Int64     | ID of the column                       
                                                             |
+| file_offset             | Int64     | Offset within the file that this 
column chunk's data begins                                         |
+| num_values              | Int64     | Total number of values in this column 
chunk                                                         |
+| path_in_schema          | Utf8      | "Path" (column name) of the column 
chunk in the schema                                              |
+| type                    | Utf8      | Parquet data type of the column chunk  
                                                             |
+| stats_min               | Utf8      | The minimum value for this column 
chunk, if stored in the statistics, cast to a string              |
+| stats_max               | Utf8      | The maximum value for this column 
chunk, if stored in the statistics, cast to a string              |
+| stats_null_count        | Int64     | Number of null values in this column 
chunk, if stored in the statistics                             |
+| stats_distinct_count    | Int64     | Number of distinct values in this 
column chunk, if stored in the statistics                         |
+| stats_min_value         | Utf8      | Same as `stats_min`                    
                                                             |
+| stats_max_value         | Utf8      | Same as `stats_max`                    
                                                             |
+| compression             | Utf8      | Block level compression (e.g. 
`SNAPPY`) used for this column chunk                                  |
+| encodings               | Utf8      | All block level encodings (e.g. 
`[PLAIN_DICTIONARY, PLAIN, RLE]`) used for this column chunk        |
+| index_page_offset       | Int64     | Offset in the file of the [`page 
index`], if any                                                    |
+| dictionary_page_offset  | Int64     | Offset in the file of the dictionary 
page, if any                                                   |
+| data_page_offset        | Int64     | Offset in the file of the first data 
page, if any                                                   |
+| total_compressed_size   | Int64     | Number of bytes the column chunk's 
data after encoding and compression (what is stored in the file) |
+| total_uncompressed_size | Int64     | Number of bytes the column chunk's 
data after encoding                                              |
+
++-------------------------+-----------+-------------+
+
+[`page index`]: 
https://github.com/apache/parquet-format/blob/master/PageIndex.md
+
 ## Changing Configuration Options
 
 All available configuration options can be seen using `SHOW ALL` as described 
above.
diff --git a/searchindex.js b/searchindex.js
index 886c387b85..1a8eeda62d 100644
--- a/searchindex.js
+++ b/searchindex.js
@@ -1 +1 @@
-Search.setIndex({"docnames": ["contributor-guide/architecture", 
"contributor-guide/communication", "contributor-guide/index", 
"contributor-guide/quarterly_roadmap", "contributor-guide/roadmap", 
"contributor-guide/specification/index", 
"contributor-guide/specification/invariants", 
"contributor-guide/specification/output-field-name-semantic", "index", 
"library-user-guide/adding-udfs", "library-user-guide/building-logical-plans", 
"library-user-guide/catalogs", "library-user-guide/custom-tab [...]
\ No newline at end of file
+Search.setIndex({"docnames": ["contributor-guide/architecture", 
"contributor-guide/communication", "contributor-guide/index", 
"contributor-guide/quarterly_roadmap", "contributor-guide/roadmap", 
"contributor-guide/specification/index", 
"contributor-guide/specification/invariants", 
"contributor-guide/specification/output-field-name-semantic", "index", 
"library-user-guide/adding-udfs", "library-user-guide/building-logical-plans", 
"library-user-guide/catalogs", "library-user-guide/custom-tab [...]
\ No newline at end of file
diff --git a/user-guide/cli.html b/user-guide/cli.html
index 3094df89b2..a1a412b756 100644
--- a/user-guide/cli.html
+++ b/user-guide/cli.html
@@ -396,7 +396,7 @@
  </li>
  <li class="toc-h2 nav-item toc-entry">
   <a class="reference internal nav-link" href="#creating-external-tables">
-   Creating external tables
+   Creating External Tables
   </a>
  </li>
  <li class="toc-h2 nav-item toc-entry">
@@ -429,6 +429,11 @@
    Commands
   </a>
  </li>
+ <li class="toc-h2 nav-item toc-entry">
+  <a class="reference internal nav-link" href="#supported-sql">
+   Supported SQL
+  </a>
+ </li>
  <li class="toc-h2 nav-item toc-entry">
   <a class="reference internal nav-link" 
href="#changing-configuration-options">
    Changing Configuration Options
@@ -642,7 +647,7 @@ DataFusion<span class="w"> </span>CLI<span class="w"> 
</span>v16.0.0
 </div>
 </section>
 <section id="creating-external-tables">
-<h2>Creating external tables<a class="headerlink" 
href="#creating-external-tables" title="Link to this heading">¶</a></h2>
+<h2>Creating External Tables<a class="headerlink" 
href="#creating-external-tables" title="Link to this heading">¶</a></h2>
 <p>It is also possible to create a table backed by files by explicitly
 via <code class="docutils literal notranslate"><span class="pre">CREATE</span> 
<span class="pre">EXTERNAL</span> <span class="pre">TABLE</span></code> as 
shown below. Filemask wildcards supported</p>
 </section>
@@ -857,6 +862,11 @@ DataFusion<span class="w"> </span>CLI<span class="w"> 
</span>v21.0.0
 <div class="highlight-bash notranslate"><div 
class="highlight"><pre><span></span>&gt;<span class="w"> </span><span 
class="se">\h</span><span class="w"> </span><span class="k">function</span>
 </pre></div>
 </div>
+</section>
+<section id="supported-sql">
+<h2>Supported SQL<a class="headerlink" href="#supported-sql" title="Link to 
this heading">¶</a></h2>
+<p>In addition to the normal <a class="reference internal" 
href="sql/index.html"><span class="std std-doc">SQL supported in 
DataFusion</span></a>, <code class="docutils literal notranslate"><span 
class="pre">datafusion-cli</span></code> also
+supports additional statements and commands:</p>
 <ul class="simple">
 <li><p>Show configuration options</p></li>
 </ul>
@@ -895,6 +905,134 @@ DataFusion<span class="w"> </span>CLI<span class="w"> 
</span>v21.0.0
 <div class="highlight-SQL notranslate"><div 
class="highlight"><pre><span></span><span class="o">&gt;</span><span class="w"> 
</span><span class="k">SET</span><span class="w"> </span><span 
class="n">datafusion</span><span class="p">.</span><span 
class="n">execution</span><span class="p">.</span><span 
class="n">batch_size</span><span class="w"> </span><span 
class="k">to</span><span class="w"> </span><span class="mi">1024</span><span 
class="p">;</span>
 </pre></div>
 </div>
+<ul class="simple">
+<li><p><code class="docutils literal notranslate"><span 
class="pre">parquet_metadata</span></code> table function</p></li>
+</ul>
+<p>The <code class="docutils literal notranslate"><span 
class="pre">parquet_metadata</span></code> table function can be used to 
inspect detailed metadata
+about a parquet file such as statistics, sizes, and other information. This can
+be helpful to understand how parquet files are structured.</p>
+<p>For example, to see information about the <code class="docutils literal 
notranslate"><span class="pre">&quot;WatchID&quot;</span></code> column in the
+<code class="docutils literal notranslate"><span 
class="pre">hits.parquet</span></code> file, you can use:</p>
+<div class="highlight-sql notranslate"><div 
class="highlight"><pre><span></span><span class="k">SELECT</span><span 
class="w"> </span><span class="n">path_in_schema</span><span 
class="p">,</span><span class="w"> </span><span 
class="n">row_group_id</span><span class="p">,</span><span class="w"> 
</span><span class="n">row_group_num_rows</span><span class="p">,</span><span 
class="w"> </span><span class="n">stats_min</span><span class="p">,</span><span 
class="w"> </span><span class="n">stats_ [...]
+<span class="k">FROM</span><span class="w"> </span><span 
class="n">parquet_metadata</span><span class="p">(</span><span 
class="s1">&#39;hits.parquet&#39;</span><span class="p">)</span>
+<span class="k">WHERE</span><span class="w"> </span><span 
class="n">path_in_schema</span><span class="w"> </span><span 
class="o">=</span><span class="w"> </span><span 
class="s1">&#39;&quot;WatchID&quot;&#39;</span>
+<span class="k">LIMIT</span><span class="w"> </span><span 
class="mi">3</span><span class="p">;</span>
+
+<span class="o">+</span><span 
class="c1">----------------+--------------+--------------------+---------------------+---------------------+-----------------------+</span>
+<span class="o">|</span><span class="w"> </span><span 
class="n">path_in_schema</span><span class="w"> </span><span 
class="o">|</span><span class="w"> </span><span 
class="n">row_group_id</span><span class="w"> </span><span 
class="o">|</span><span class="w"> </span><span 
class="n">row_group_num_rows</span><span class="w"> </span><span 
class="o">|</span><span class="w"> </span><span class="n">stats_min</span><span 
class="w">           </span><span class="o">|</span><span class="w"> </span>< 
[...]
+<span class="o">+</span><span 
class="c1">----------------+--------------+--------------------+---------------------+---------------------+-----------------------+</span>
+<span class="o">|</span><span class="w"> </span><span 
class="ss">&quot;WatchID&quot;</span><span class="w">      </span><span 
class="o">|</span><span class="w"> </span><span class="mi">0</span><span 
class="w">            </span><span class="o">|</span><span class="w"> 
</span><span class="mi">450560</span><span class="w">             </span><span 
class="o">|</span><span class="w"> </span><span 
class="mi">4611687214012840539</span><span class="w"> </span><span 
class="o">|</span><span class [...]
+<span class="o">|</span><span class="w"> </span><span 
class="ss">&quot;WatchID&quot;</span><span class="w">      </span><span 
class="o">|</span><span class="w"> </span><span class="mi">1</span><span 
class="w">            </span><span class="o">|</span><span class="w"> 
</span><span class="mi">612174</span><span class="w">             </span><span 
class="o">|</span><span class="w"> </span><span 
class="mi">4611689135232456464</span><span class="w"> </span><span 
class="o">|</span><span class [...]
+<span class="o">|</span><span class="w"> </span><span 
class="ss">&quot;WatchID&quot;</span><span class="w">      </span><span 
class="o">|</span><span class="w"> </span><span class="mi">2</span><span 
class="w">            </span><span class="o">|</span><span class="w"> 
</span><span class="mi">344064</span><span class="w">             </span><span 
class="o">|</span><span class="w"> </span><span 
class="mi">4611692774829951781</span><span class="w"> </span><span 
class="o">|</span><span class [...]
+<span class="o">+</span><span 
class="c1">----------------+--------------+--------------------+---------------------+---------------------+-----------------------+</span>
+<span class="mi">3</span><span class="w"> </span><span 
class="k">rows</span><span class="w"> </span><span class="k">in</span><span 
class="w"> </span><span class="k">set</span><span class="p">.</span><span 
class="w"> </span><span class="n">Query</span><span class="w"> </span><span 
class="n">took</span><span class="w"> </span><span class="mi">0</span><span 
class="p">.</span><span class="mi">053</span><span class="w"> </span><span 
class="n">seconds</span><span class="p">.</span>
+</pre></div>
+</div>
+<p>The returned table has the following columns for each row for each column 
chunk
+in the file. Please refer to the <a class="reference external" 
href="https://parquet.apache.org/";>Parquet Documentation</a> for more 
information.</p>
+<table class="table">
+<thead>
+<tr class="row-odd"><th class="head"><p>column_name</p></th>
+<th class="head"><p>data_type</p></th>
+<th class="head"><p>Description</p></th>
+</tr>
+</thead>
+<tbody>
+<tr class="row-even"><td><p>filename</p></td>
+<td><p>Utf8</p></td>
+<td><p>Name of the file</p></td>
+</tr>
+<tr class="row-odd"><td><p>row_group_id</p></td>
+<td><p>Int64</p></td>
+<td><p>Row group index the column chunk belongs to</p></td>
+</tr>
+<tr class="row-even"><td><p>row_group_num_rows</p></td>
+<td><p>Int64</p></td>
+<td><p>Count of rows stored in the row group</p></td>
+</tr>
+<tr class="row-odd"><td><p>row_group_num_columns</p></td>
+<td><p>Int64</p></td>
+<td><p>Total number of columns in the row group (same for all row 
groups)</p></td>
+</tr>
+<tr class="row-even"><td><p>row_group_bytes</p></td>
+<td><p>Int64</p></td>
+<td><p>Number of bytes used to store the row group (not including 
metadata)</p></td>
+</tr>
+<tr class="row-odd"><td><p>column_id</p></td>
+<td><p>Int64</p></td>
+<td><p>ID of the column</p></td>
+</tr>
+<tr class="row-even"><td><p>file_offset</p></td>
+<td><p>Int64</p></td>
+<td><p>Offset within the file that this column chunk’s data begins</p></td>
+</tr>
+<tr class="row-odd"><td><p>num_values</p></td>
+<td><p>Int64</p></td>
+<td><p>Total number of values in this column chunk</p></td>
+</tr>
+<tr class="row-even"><td><p>path_in_schema</p></td>
+<td><p>Utf8</p></td>
+<td><p>“Path” (column name) of the column chunk in the schema</p></td>
+</tr>
+<tr class="row-odd"><td><p>type</p></td>
+<td><p>Utf8</p></td>
+<td><p>Parquet data type of the column chunk</p></td>
+</tr>
+<tr class="row-even"><td><p>stats_min</p></td>
+<td><p>Utf8</p></td>
+<td><p>The minimum value for this column chunk, if stored in the statistics, 
cast to a string</p></td>
+</tr>
+<tr class="row-odd"><td><p>stats_max</p></td>
+<td><p>Utf8</p></td>
+<td><p>The maximum value for this column chunk, if stored in the statistics, 
cast to a string</p></td>
+</tr>
+<tr class="row-even"><td><p>stats_null_count</p></td>
+<td><p>Int64</p></td>
+<td><p>Number of null values in this column chunk, if stored in the 
statistics</p></td>
+</tr>
+<tr class="row-odd"><td><p>stats_distinct_count</p></td>
+<td><p>Int64</p></td>
+<td><p>Number of distinct values in this column chunk, if stored in the 
statistics</p></td>
+</tr>
+<tr class="row-even"><td><p>stats_min_value</p></td>
+<td><p>Utf8</p></td>
+<td><p>Same as <code class="docutils literal notranslate"><span 
class="pre">stats_min</span></code></p></td>
+</tr>
+<tr class="row-odd"><td><p>stats_max_value</p></td>
+<td><p>Utf8</p></td>
+<td><p>Same as <code class="docutils literal notranslate"><span 
class="pre">stats_max</span></code></p></td>
+</tr>
+<tr class="row-even"><td><p>compression</p></td>
+<td><p>Utf8</p></td>
+<td><p>Block level compression (e.g. <code class="docutils literal 
notranslate"><span class="pre">SNAPPY</span></code>) used for this column 
chunk</p></td>
+</tr>
+<tr class="row-odd"><td><p>encodings</p></td>
+<td><p>Utf8</p></td>
+<td><p>All block level encodings (e.g. <code class="docutils literal 
notranslate"><span class="pre">[PLAIN_DICTIONARY,</span> <span 
class="pre">PLAIN,</span> <span class="pre">RLE]</span></code>) used for this 
column chunk</p></td>
+</tr>
+<tr class="row-even"><td><p>index_page_offset</p></td>
+<td><p>Int64</p></td>
+<td><p>Offset in the file of the <a class="reference external" 
href="https://github.com/apache/parquet-format/blob/master/PageIndex.md";><code 
class="docutils literal notranslate"><span class="pre">page</span> <span 
class="pre">index</span></code></a>, if any</p></td>
+</tr>
+<tr class="row-odd"><td><p>dictionary_page_offset</p></td>
+<td><p>Int64</p></td>
+<td><p>Offset in the file of the dictionary page, if any</p></td>
+</tr>
+<tr class="row-even"><td><p>data_page_offset</p></td>
+<td><p>Int64</p></td>
+<td><p>Offset in the file of the first data page, if any</p></td>
+</tr>
+<tr class="row-odd"><td><p>total_compressed_size</p></td>
+<td><p>Int64</p></td>
+<td><p>Number of bytes the column chunk’s data after encoding and compression 
(what is stored in the file)</p></td>
+</tr>
+<tr class="row-even"><td><p>total_uncompressed_size</p></td>
+<td><p>Int64</p></td>
+<td><p>Number of bytes the column chunk’s data after encoding</p></td>
+</tr>
+</tbody>
+</table>
+<p>+————————-+———–+————-+</p>
 </section>
 <section id="changing-configuration-options">
 <h2>Changing Configuration Options<a class="headerlink" 
href="#changing-configuration-options" title="Link to this heading">¶</a></h2>

(arrow-datafusion) branch asf-site updated: Publish built docs triggered by 1dcdcd431187178d736cdd3a6c004204aa2faa14

Reply via email to