This is an automated email from the ASF dual-hosted git repository.
agrove pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-datafusion.git
The following commit(s) were added to refs/heads/main by this push:
new 0573f78c7e Update datafusion-cli docs, split up (#10078)
0573f78c7e is described below
commit 0573f78c7e7a4d94c3204cee464b3860479e0afb
Author: Andrew Lamb <[email protected]>
AuthorDate: Tue Apr 16 09:58:16 2024 -0400
Update datafusion-cli docs, split up (#10078)
* Update datafusion-cli docs, split up
* remove PiPI instructions
---
datafusion-cli/README.md | 8 +-
docs/source/index.rst | 2 +-
docs/source/user-guide/cli.md | 660 -----------------------------
docs/source/user-guide/cli/datasources.md | 349 +++++++++++++++
docs/source/user-guide/cli/index.rst | 27 ++
docs/source/user-guide/cli/installation.md | 63 +++
docs/source/user-guide/cli/overview.md | 45 ++
docs/source/user-guide/cli/usage.md | 288 +++++++++++++
8 files changed, 777 insertions(+), 665 deletions(-)
diff --git a/datafusion-cli/README.md b/datafusion-cli/README.md
index 0afcd489f7..2aae6bfc30 100644
--- a/datafusion-cli/README.md
+++ b/datafusion-cli/README.md
@@ -23,20 +23,20 @@
[DataFusion](https://arrow.apache.org/datafusion/) is an extensible query
execution framework, written in Rust, that uses Apache Arrow as its in-memory
format.
-The DataFusion CLI is a command line utility that runs SQL queries using the
DataFusion engine.
+DataFusion CLI (`datafusion-cli`) is a small command line utility that runs
SQL queries using the DataFusion engine.
# Frequently Asked Questions
## Where can I find more information?
-Answer: See the [`datafusion-cli`
documentation](https://arrow.apache.org/datafusion/user-guide/cli.html) for
further information.
+See the [`datafusion-cli`
documentation](https://arrow.apache.org/datafusion/user-guide/cli.html) for
further information.
## How do I make my IDE work with `datafusion-cli`?
-Answer: "open" the `datafusion/datafusion-cli` project as its own top level
+"open" the `datafusion/datafusion-cli` project as its own top level
project in my IDE (rather than opening `datafusion`)
-The reason `datafusion-cli` is not listed as part of the workspace in the main
+The reason `datafusion-cli` is not part of the main workspace in
[`datafusion Cargo.toml`] file is that `datafusion-cli` is a binary and has a
checked in `Cargo.lock` file to ensure reproducible builds.
diff --git a/docs/source/index.rst b/docs/source/index.rst
index 919a7ad703..8d8d5f7ef0 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -67,7 +67,7 @@ Please see the `developerβs guide`_ for contributing and
`communication`_ for
user-guide/introduction
user-guide/example-usage
- user-guide/cli
+ user-guide/cli/index
user-guide/dataframe
user-guide/expressions
user-guide/sql/index
diff --git a/docs/source/user-guide/cli.md b/docs/source/user-guide/cli.md
deleted file mode 100644
index 9c3fc8bd60..0000000000
--- a/docs/source/user-guide/cli.md
+++ /dev/null
@@ -1,660 +0,0 @@
-<!---
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements. See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership. The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License. You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied. See the License for the
- specific language governing permissions and limitations
- under the License.
--->
-
-# Command line SQL console
-
-The DataFusion CLI is a command-line interactive SQL utility for executing
-queries against any supported data files. It is a convenient way to
-try DataFusion's SQL support with your own data.
-
-## Installation
-
-### Install and run using Cargo
-
-The easiest way to install DataFusion CLI a spin is via `cargo install
datafusion-cli`.
-
-### Install and run using Homebrew (on MacOS)
-
-DataFusion CLI can also be installed via Homebrew (on MacOS). If you don't
have Homebrew installed, you can check how to install it
[here](https://docs.brew.sh/Installation).
-
-Install it as any other pre-built software like this:
-
-```bash
-brew install datafusion
-# ==> Downloading https://ghcr.io/v2/homebrew/core/datafusion/manifests/12.0.0
-# ########################################################################
100.0%
-# ==> Downloading
https://ghcr.io/v2/homebrew/core/datafusion/blobs/sha256:9ecc8a01be47ceb9a53b39976696afa87c0a8
-# ==> Downloading from
https://pkg-containers.githubusercontent.com/ghcr1/blobs/sha256:9ecc8a01be47ceb9a53b39976
-# ########################################################################
100.0%
-# ==> Pouring datafusion--12.0.0.big_sur.bottle.tar.gz
-# πΊ /usr/local/Cellar/datafusion/12.0.0: 9 files, 17.4MB
-
-datafusion-cli
-```
-
-### Install and run using PyPI
-
-DataFusion CLI can also be installed via PyPI. You can check how to install
PyPI [here](https://pip.pypa.io/en/latest/installation/).
-
-Install it as any other pre-built software like this:
-
-```bash
-pip3 install datafusion
-# Defaulting to user installation because normal site-packages is not writeable
-# Collecting datafusion
-# Downloading datafusion-33.0.0-cp38-abi3-macosx_11_0_arm64.whl.metadata
(9.6 kB)
-# Collecting pyarrow>=11.0.0 (from datafusion)
-# Downloading pyarrow-14.0.1-cp39-cp39-macosx_11_0_arm64.whl.metadata (3.0
kB)
-# Requirement already satisfied: numpy>=1.16.6 in
/Users/Library/Python/3.9/lib/python/site-packages (from
pyarrow>=11.0.0->datafusion) (1.23.4)
-# Downloading datafusion-33.0.0-cp38-abi3-macosx_11_0_arm64.whl (13.5 MB)
-# ββββββββββββββββββββββββββββββββββββββββ 13.5/13.5 MB 3.6 MB/s eta 0:00:00
-# Downloading pyarrow-14.0.1-cp39-cp39-macosx_11_0_arm64.whl (24.0 MB)
-# ββββββββββββββββββββββββββββββββββββββββ 24.0/24.0 MB 36.4 MB/s eta
0:00:00
-# Installing collected packages: pyarrow, datafusion
-# Attempting uninstall: pyarrow
-# Found existing installation: pyarrow 10.0.1
-# Uninstalling pyarrow-10.0.1:
-# Successfully uninstalled pyarrow-10.0.1
-# Successfully installed datafusion-33.0.0 pyarrow-14.0.1
-
-datafusion-cli
-```
-
-### Run using Docker
-
-There is no officially published Docker image for the DataFusion CLI, so it is
necessary to build from source
-instead.
-
-Use the following commands to clone this repository and build a Docker image
containing the CLI tool. Note
-that there is `.dockerignore` file in the root of the repository that may need
to be deleted in order for
-this to work.
-
-```bash
-git clone https://github.com/apache/arrow-datafusion
-cd arrow-datafusion
-git checkout 12.0.0
-docker build -f datafusion-cli/Dockerfile . --tag datafusion-cli
-docker run -it -v $(your_data_location):/data datafusion-cli
-```
-
-## Usage
-
-See the current usage using `datafusion-cli --help`:
-
-```bash
-Apache Arrow <[email protected]>
-Command Line Client for DataFusion query engine.
-
-USAGE:
- datafusion-cli [OPTIONS]
-
-OPTIONS:
- -b, --batch-size <BATCH_SIZE>
- The batch size of each query, or use DataFusion default
-
- -c, --command <COMMAND>...
- Execute the given command string(s), then exit
-
- --color
- Enables console syntax highlighting
-
- -f, --file <FILE>...
- Execute commands from file(s), then exit
-
- --format <FORMAT>
- [default: table] [possible values: csv, tsv, table, json, nd-json]
-
- -h, --help
- Print help information
-
- -m, --memory-limit <MEMORY_LIMIT>
- The memory pool limitation (e.g. '10g'), default to None (no limit)
-
- --maxrows <MAXROWS>
- The max number of rows to display for 'Table' format
- [default: 40] [possible values: numbers(0/10/...), inf(no limit)]
-
- --mem-pool-type <MEM_POOL_TYPE>
- Specify the memory pool type 'greedy' or 'fair', default to
'greedy'
-
- -p, --data-path <DATA_PATH>
- Path to your data, default to current directory
-
- -q, --quiet
- Reduce printing other than the results and work quietly
-
- -r, --rc <RC>...
- Run the provided files on startup instead of ~/.datafusionrc
-
- -V, --version
- Print version information
-```
-
-## Querying data from the files directly
-
-Files can be queried directly by enclosing the file, directory name
-or a remote location in single `'` quotes as shown in the examples.
-
-## Examples
-
-Create a CSV file to query.
-
-```shell
-$ echo "a,b" > data.csv
-$ echo "1,2" >> data.csv
-```
-
-Query that single file (the CLI also supports parquet, compressed csv, avro,
json and more)
-
-```shell
-$ datafusion-cli
-DataFusion CLI v17.0.0
-> select * from 'data.csv';
-+---+---+
-| a | b |
-+---+---+
-| 1 | 2 |
-+---+---+
-1 row in set. Query took 0.007 seconds.
-```
-
-You can also query directories of files with compatible schemas:
-
-```shell
-$ ls data_dir/
-data.csv data2.csv
-```
-
-```shell
-$ datafusion-cli
-DataFusion CLI v16.0.0
-> select * from 'data_dir';
-+---+---+
-| a | b |
-+---+---+
-| 3 | 4 |
-| 1 | 2 |
-+---+---+
-2 rows in set. Query took 0.007 seconds.
-```
-
-You can also query directly from any remote location supported by DataFusion
without
-registering the location as a table.
-For example, to read from a remote parquet file via HTTP(S) you can use the
following:
-
-```sql
-select count(*) from
'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet'
-+----------+
-| COUNT(*) |
-+----------+
-| 1000000 |
-+----------+
-1 row in set. Query took 0.595 seconds.
-```
-
-To read from an AWS S3 or GCS, use `s3` or `gs` as a protocol prefix. For
example, this will read a file
-in S3 bucket named `my-data-bucket`. Note that this is not a real file
location and therefore the query
-will fail, you need to use your own file location in S3. Also, you need to set
the relevent access credentials
-as environmental variables (e.g. for AWS S3 you need to at least
`AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`)
-
-```sql
-select count(*) from 's3://my-data-bucket/athena_partitioned/hits.parquet'
-```
-
-## Creating External Tables
-
-It is also possible to create a table backed by files by explicitly
-via `CREATE EXTERNAL TABLE` as shown below. Filemask wildcards supported
-
-## Registering Parquet Data Sources
-
-Parquet data sources can be registered by executing a `CREATE EXTERNAL TABLE`
SQL statement. The schema information will be derived automatically.
-
-Register a single file parquet datasource
-
-```sql
-CREATE EXTERNAL TABLE taxi
-STORED AS PARQUET
-LOCATION '/mnt/nyctaxi/tripdata.parquet';
-```
-
-Register a single folder parquet datasource. All files inside must be valid
parquet files!
-
-```sql
-CREATE EXTERNAL TABLE taxi
-STORED AS PARQUET
-LOCATION '/mnt/nyctaxi/';
-```
-
-Register a single folder parquet datasource by specifying a wildcard for files
to read
-
-```sql
-CREATE EXTERNAL TABLE taxi
-STORED AS PARQUET
-LOCATION '/mnt/nyctaxi/*.parquet';
-```
-
-## Registering CSV Data Sources
-
-CSV data sources can be registered by executing a `CREATE EXTERNAL TABLE` SQL
statement.
-
-```sql
-CREATE EXTERNAL TABLE test
-STORED AS CSV
-WITH HEADER ROW
-LOCATION '/path/to/aggregate_test_100.csv';
-```
-
-It is also possible to provide schema information.
-
-```sql
-CREATE EXTERNAL TABLE test (
- c1 VARCHAR NOT NULL,
- c2 INT NOT NULL,
- c3 SMALLINT NOT NULL,
- c4 SMALLINT NOT NULL,
- c5 INT NOT NULL,
- c6 BIGINT NOT NULL,
- c7 SMALLINT NOT NULL,
- c8 INT NOT NULL,
- c9 BIGINT NOT NULL,
- c10 VARCHAR NOT NULL,
- c11 FLOAT NOT NULL,
- c12 DOUBLE NOT NULL,
- c13 VARCHAR NOT NULL
-)
-STORED AS CSV
-LOCATION '/path/to/aggregate_test_100.csv';
-```
-
-## Registering Remote Data Sources
-
-`datafusion-cli` can read from remote locations using a variety of protocols.
-For example to read from a remote parquet file via HTTP(S) you can use the
following:
-
-```sql
-CREATE EXTERNAL TABLE hits
-STORED AS PARQUET
-LOCATION
'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet';
-```
-
-```sql
-select count(*) from hits;
-+----------+
-| COUNT(*) |
-+----------+
-| 1000000 |
-+----------+
-1 row in set. Query took 0.344 seconds.
-```
-
-## Registering S3 Data Sources
-
-[AWS S3](https://aws.amazon.com/s3/) data sources can be registered by
executing a `CREATE EXTERNAL TABLE` SQL statement.
-
-```sql
-CREATE EXTERNAL TABLE test
-STORED AS PARQUET
-OPTIONS(
- 'aws.access_key_id' '******',
- 'aws.secret_access_key' '******',
- 'aws.region' 'us-east-2'
-)
-LOCATION 's3://bucket/path/file.parquet';
-```
-
-The supported OPTIONS are:
-
-- access_key_id
-- secret_access_key
-- session_token
-- region
-
-It is also possible to simplify sql statements by environment variables.
-
-```bash
-$ export AWS_DEFAULT_REGION=us-east-2
-$ export AWS_SECRET_ACCESS_KEY=******
-$ export AWS_ACCESS_KEY_ID=******
-
-$ datafusion-cli
-DataFusion CLI v21.0.0
-> create external table test stored as parquet location
's3://bucket/path/file.parquet';
-0 rows in set. Query took 0.374 seconds.
-> select * from test;
-+----------+----------+
-| column_1 | column_2 |
-+----------+----------+
-| 1 | 2 |
-+----------+----------+
-1 row in set. Query took 0.171 seconds.
-```
-
-Details of the environment variables that can be used are:
-
-- AWS_ACCESS_KEY_ID -> access_key_id
-- AWS_SECRET_ACCESS_KEY -> secret_access_key
-- AWS_DEFAULT_REGION -> region
-- AWS_ENDPOINT -> endpoint
-- AWS_SESSION_TOKEN -> token
-- AWS_CONTAINER_CREDENTIALS_RELATIVE_URI ->
<https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-iam-roles.html>
-- AWS_ALLOW_HTTP -> set to "true" to permit HTTP connections without TLS
-- AWS_PROFILE -> Support for using a [named
profile](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html)
to supply credentials
-
-## Registering OSS Data Sources
-
-[Alibaba cloud
OSS](https://www.alibabacloud.com/product/object-storage-service) data sources
can be registered by executing a `CREATE EXTERNAL TABLE` SQL statement.
-
-```sql
-CREATE EXTERNAL TABLE test
-STORED AS PARQUET
-OPTIONS(
- 'aws.access_key_id' '******',
- 'aws.secret_access_key' '******',
- 'aws.oss.endpoint' 'https://bucket.oss-cn-hangzhou.aliyuncs.com'
-)
-LOCATION 'oss://bucket/path/file.parquet';
-```
-
-The supported OPTIONS are:
-
-- access_key_id
-- secret_access_key
-- endpoint
-
-Note that the `endpoint` format of oss needs to be:
`https://{bucket}.{oss-region-endpoint}`
-
-## Registering COS Data Sources
-
-[Tencent cloud COS](https://cloud.tencent.com/product/cos) data sources can be
registered by executing a `CREATE EXTERNAL TABLE` SQL statement.
-
-```sql
-CREATE EXTERNAL TABLE test
-STORED AS PARQUET
-OPTIONS(
- 'aws.access_key_id' '******',
- 'aws.secret_access_key' '******',
- 'aws.cos.endpoint' 'https://cos.ap-singapore.myqcloud.com'
-)
-LOCATION 'cos://bucket/path/file.parquet';
-```
-
-The supported OPTIONS are:
-
-- access_key_id
-- secret_access_key
-- endpoint
-
-Note that the `endpoint` format of urls must be:
`https://cos.{cos-region-endpoint}`
-
-## Registering GCS Data Sources
-
-[Google Cloud Storage](https://cloud.google.com/storage) data sources can be
registered by executing a `CREATE EXTERNAL TABLE` SQL statement.
-
-```sql
-CREATE EXTERNAL TABLE test
-STORED AS PARQUET
-OPTIONS(
- 'gcp.service_account_path' '/tmp/gcs.json',
-)
-LOCATION 'gs://bucket/path/file.parquet';
-```
-
-The supported OPTIONS are:
-
-- service_account_path -> location of service account file
-- service_account_key -> JSON serialized service account key
-- application_credentials_path -> location of application credentials file
-
-It is also possible to simplify sql statements by environment variables.
-
-```bash
-$ export GOOGLE_SERVICE_ACCOUNT=/tmp/gcs.json
-
-$ datafusion-cli
-DataFusion CLI v21.0.0
-> create external table test stored as parquet location
'gs://bucket/path/file.parquet';
-0 rows in set. Query took 0.374 seconds.
-> select * from test;
-+----------+----------+
-| column_1 | column_2 |
-+----------+----------+
-| 1 | 2 |
-+----------+----------+
-1 row in set. Query took 0.171 seconds.
-```
-
-Details of the environment variables that can be used are:
-
-- GOOGLE_SERVICE_ACCOUNT: location of service account file
-- GOOGLE_SERVICE_ACCOUNT_PATH: (alias) location of service account file
-- SERVICE_ACCOUNT: (alias) location of service account file
-- GOOGLE_SERVICE_ACCOUNT_KEY: JSON serialized service account key
-- GOOGLE_BUCKET: bucket name
-- GOOGLE_BUCKET_NAME: (alias) bucket name
-
-## Commands
-
-Available commands inside DataFusion CLI are:
-
-- Quit
-
-```bash
-> \q
-```
-
-- Help
-
-```bash
-> \?
-```
-
-- ListTables
-
-```bash
-> \d
-```
-
-- DescribeTable
-
-```bash
-> \d table_name
-```
-
-- QuietMode
-
-```bash
-> \quiet [true|false]
-```
-
-- list function
-
-```bash
-> \h
-```
-
-- Search and describe function
-
-```bash
-> \h function
-```
-
-## Supported SQL
-
-In addition to the normal [SQL supported in DataFusion], `datafusion-cli` also
-supports additional statements and commands:
-
-[sql supported in datafusion]: sql/index.rst
-
-- Show configuration options
-
-`SHOW ALL [VERBOSE]`
-
-```SQL
-> show all;
-
-+-------------------------------------------------+---------+
-| name | value |
-+-------------------------------------------------+---------+
-| datafusion.execution.batch_size | 8192 |
-| datafusion.execution.coalesce_batches | true |
-| datafusion.execution.time_zone | UTC |
-| datafusion.explain.logical_plan_only | false |
-| datafusion.explain.physical_plan_only | false |
-| datafusion.optimizer.filter_null_join_keys | false |
-| datafusion.optimizer.skip_failed_rules | true |
-+-------------------------------------------------+---------+
-
-```
-
-- Show specific configuration option
-
-`SHOW xyz.abc.qwe [VERBOSE]`
-
-```SQL
-> show datafusion.execution.batch_size;
-
-+-------------------------------------------------+---------+
-| name | value |
-+-------------------------------------------------+---------+
-| datafusion.execution.batch_size | 8192 |
-+-------------------------------------------------+---------+
-
-```
-
-- Set configuration options
-
-```SQL
-> SET datafusion.execution.batch_size to 1024;
-```
-
-- `parquet_metadata` table function
-
-The `parquet_metadata` table function can be used to inspect detailed metadata
-about a parquet file such as statistics, sizes, and other information. This can
-be helpful to understand how parquet files are structured.
-
-For example, to see information about the `"WatchID"` column in the
-`hits.parquet` file, you can use:
-
-```sql
-SELECT path_in_schema, row_group_id, row_group_num_rows, stats_min, stats_max,
total_compressed_size
-FROM parquet_metadata('hits.parquet')
-WHERE path_in_schema = '"WatchID"'
-LIMIT 3;
-
-+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
-| path_in_schema | row_group_id | row_group_num_rows | stats_min |
stats_max | total_compressed_size |
-+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
-| "WatchID" | 0 | 450560 | 4611687214012840539 |
9223369186199968220 | 3883759 |
-| "WatchID" | 1 | 612174 | 4611689135232456464 |
9223371478009085789 | 5176803 |
-| "WatchID" | 2 | 344064 | 4611692774829951781 |
9223363791697310021 | 3031680 |
-+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
-3 rows in set. Query took 0.053 seconds.
-```
-
-The returned table has the following columns for each row for each column chunk
-in the file. Please refer to the [Parquet Documentation] for more information.
-
-[parquet documentation]: https://parquet.apache.org/
-
-| column_name | data_type | Description
|
-| ----------------------- | --------- |
---------------------------------------------------------------------------------------------------
|
-| filename | Utf8 | Name of the file
|
-| row_group_id | Int64 | Row group index the column chunk
belongs to |
-| row_group_num_rows | Int64 | Count of rows stored in the row group
|
-| row_group_num_columns | Int64 | Total number of columns in the row
group (same for all row groups) |
-| row_group_bytes | Int64 | Number of bytes used to store the row
group (not including metadata) |
-| column_id | Int64 | ID of the column
|
-| file_offset | Int64 | Offset within the file that this
column chunk's data begins |
-| num_values | Int64 | Total number of values in this column
chunk |
-| path_in_schema | Utf8 | "Path" (column name) of the column
chunk in the schema |
-| type | Utf8 | Parquet data type of the column chunk
|
-| stats_min | Utf8 | The minimum value for this column
chunk, if stored in the statistics, cast to a string |
-| stats_max | Utf8 | The maximum value for this column
chunk, if stored in the statistics, cast to a string |
-| stats_null_count | Int64 | Number of null values in this column
chunk, if stored in the statistics |
-| stats_distinct_count | Int64 | Number of distinct values in this
column chunk, if stored in the statistics |
-| stats_min_value | Utf8 | Same as `stats_min`
|
-| stats_max_value | Utf8 | Same as `stats_max`
|
-| compression | Utf8 | Block level compression (e.g.
`SNAPPY`) used for this column chunk |
-| encodings | Utf8 | All block level encodings (e.g.
`[PLAIN_DICTIONARY, PLAIN, RLE]`) used for this column chunk |
-| index_page_offset | Int64 | Offset in the file of the [`page
index`], if any |
-| dictionary_page_offset | Int64 | Offset in the file of the dictionary
page, if any |
-| data_page_offset | Int64 | Offset in the file of the first data
page, if any |
-| total_compressed_size | Int64 | Number of bytes the column chunk's
data after encoding and compression (what is stored in the file) |
-| total_uncompressed_size | Int64 | Number of bytes the column chunk's
data after encoding |
-
-+-------------------------+-----------+-------------+
-
-[`page index`]:
https://github.com/apache/parquet-format/blob/master/PageIndex.md
-
-## Changing Configuration Options
-
-All available configuration options can be seen using `SHOW ALL` as described
above.
-
-You can change the configuration options using environment
-variables. `datafusion-cli` looks in the corresponding environment
-variable with an upper case name and all `.` converted to `_`.
-
-For example, to set `datafusion.execution.batch_size` to `1024` you
-would set the `DATAFUSION_EXECUTION_BATCH_SIZE` environment variable
-appropriately:
-
-```shell
-$ DATAFUSION_EXECUTION_BATCH_SIZE=1024 datafusion-cli
-DataFusion CLI v12.0.0
-> show all;
-+-------------------------------------------------+---------+
-| name | value |
-+-------------------------------------------------+---------+
-| datafusion.execution.batch_size | 1024 |
-| datafusion.execution.coalesce_batches | true |
-| datafusion.execution.time_zone | UTC |
-| datafusion.explain.logical_plan_only | false |
-| datafusion.explain.physical_plan_only | false |
-| datafusion.optimizer.filter_null_join_keys | false |
-| datafusion.optimizer.skip_failed_rules | true |
-+-------------------------------------------------+---------+
-8 rows in set. Query took 0.002 seconds.
-```
-
-You can change the configuration options using `SET` statement as well
-
-```shell
-$ datafusion-cli
-DataFusion CLI v13.0.0
-> show datafusion.execution.batch_size;
-+---------------------------------+---------+
-| name | value |
-+---------------------------------+---------+
-| datafusion.execution.batch_size | 8192 |
-+---------------------------------+---------+
-1 row in set. Query took 0.011 seconds.
-
-> set datafusion.execution.batch_size to 1024;
-0 rows in set. Query took 0.000 seconds.
-
-> show datafusion.execution.batch_size;
-+---------------------------------+---------+
-| name | value |
-+---------------------------------+---------+
-| datafusion.execution.batch_size | 1024 |
-+---------------------------------+---------+
-1 row in set. Query took 0.005 seconds.
-```
diff --git a/docs/source/user-guide/cli/datasources.md
b/docs/source/user-guide/cli/datasources.md
new file mode 100644
index 0000000000..c2c00b6334
--- /dev/null
+++ b/docs/source/user-guide/cli/datasources.md
@@ -0,0 +1,349 @@
+<!---
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements. See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership. The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied. See the License for the
+ specific language governing permissions and limitations
+ under the License.
+-->
+
+# Local Files / Directories
+
+Files can be queried directly by enclosing the file, directory name
+or a remote location in single `'` quotes as shown in the examples.
+
+Create a CSV file to query.
+
+```shell
+$ echo "a,b" > data.csv
+$ echo "1,2" >> data.csv
+```
+
+Query that single file (the CLI also supports parquet, compressed csv, avro,
json and more)
+
+```shell
+$ datafusion-cli
+DataFusion CLI v17.0.0
+> select * from 'data.csv';
++---+---+
+| a | b |
++---+---+
+| 1 | 2 |
++---+---+
+1 row in set. Query took 0.007 seconds.
+```
+
+You can also query directories of files with compatible schemas:
+
+```shell
+$ ls data_dir/
+data.csv data2.csv
+```
+
+```shell
+$ datafusion-cli
+DataFusion CLI v16.0.0
+> select * from 'data_dir';
++---+---+
+| a | b |
++---+---+
+| 3 | 4 |
+| 1 | 2 |
++---+---+
+2 rows in set. Query took 0.007 seconds.
+```
+
+# Remote Files / Directories
+
+You can also query directly any remote location supported by DataFusion without
+registering the location as a table.
+For example, to read from a remote parquet file via HTTP(S) you can use the
following:
+
+```sql
+select count(*) from
'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet'
++----------+
+| COUNT(*) |
++----------+
+| 1000000 |
++----------+
+1 row in set. Query took 0.595 seconds.
+```
+
+To read from an AWS S3 or GCS, use `s3` or `gs` as a protocol prefix. For
+example, to read a file in an S3 bucket named `my-data-bucket` use the URL
+`s3://my-data-bucket`and set the relevant access credentials as environmental
+variables (e.g. for AWS S3 you need to at least `AWS_ACCESS_KEY_ID` and
+`AWS_SECRET_ACCESS_KEY`).
+
+```sql
+select count(*) from 's3://my-data-bucket/athena_partitioned/hits.parquet'
+```
+
+See the [`CREATE EXTERNAL TABLE`](#create-external-table) section for
+additional configuration options.
+
+# `CREATE EXTERNAL TABLE`
+
+It is also possible to create a table backed by files or remote locations via
+`CREATE EXTERNAL TABLE` as shown below. Note that wildcards (e.g. `*`) are also
+supported
+
+For example, to create a table `hits` backed by a local parquet file, use:
+
+```sql
+CREATE EXTERNAL TABLE hits
+STORED AS PARQUET
+LOCATION 'hits.parquet';
+```
+
+To create a table `hits` backed by a remote parquet file via HTTP(S), use
+
+```sql
+CREATE EXTERNAL TABLE hits
+STORED AS PARQUET
+LOCATION
'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet';
+```
+
+In both cases, `hits` now can be queried as a regular table:
+
+```sql
+select count(*) from hits;
++----------+
+| COUNT(*) |
++----------+
+| 1000000 |
++----------+
+1 row in set. Query took 0.344 seconds.
+```
+
+# Formats
+
+## Parquet
+
+The schema information for parquet will be derived automatically.
+
+Register a single file parquet datasource
+
+```sql
+CREATE EXTERNAL TABLE taxi
+STORED AS PARQUET
+LOCATION '/mnt/nyctaxi/tripdata.parquet';
+```
+
+Register a single folder parquet datasource. Note: All files inside must be
valid
+parquet files and have compatible schemas
+
+```sql
+CREATE EXTERNAL TABLE taxi
+STORED AS PARQUET
+LOCATION '/mnt/nyctaxi/';
+```
+
+Register a single folder parquet datasource by specifying a wildcard for files
to read
+
+```sql
+CREATE EXTERNAL TABLE taxi
+STORED AS PARQUET
+LOCATION '/mnt/nyctaxi/*.parquet';
+```
+
+## CSV
+
+DataFusion will infer the CSV schema automatically or you can provide it
explicitly.
+
+Register a single file csv datasource with a header row.
+
+```sql
+CREATE EXTERNAL TABLE test
+STORED AS CSV
+WITH HEADER ROW
+LOCATION '/path/to/aggregate_test_100.csv';
+```
+
+Register a single file csv datasource with explicitly defined schema.
+
+```sql
+CREATE EXTERNAL TABLE test (
+ c1 VARCHAR NOT NULL,
+ c2 INT NOT NULL,
+ c3 SMALLINT NOT NULL,
+ c4 SMALLINT NOT NULL,
+ c5 INT NOT NULL,
+ c6 BIGINT NOT NULL,
+ c7 SMALLINT NOT NULL,
+ c8 INT NOT NULL,
+ c9 BIGINT NOT NULL,
+ c10 VARCHAR NOT NULL,
+ c11 FLOAT NOT NULL,
+ c12 DOUBLE NOT NULL,
+ c13 VARCHAR NOT NULL
+)
+STORED AS CSV
+LOCATION '/path/to/aggregate_test_100.csv';
+```
+
+# Locations
+
+## HTTP(s)
+
+To read from a remote parquet file via HTTP(S) you can use the following:
+
+```sql
+CREATE EXTERNAL TABLE hits
+STORED AS PARQUET
+LOCATION
'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet';
+```
+
+## S3
+
+[AWS S3](https://aws.amazon.com/s3/) data sources must have connection
credentials configured.
+
+To create an external table from a file in an S3 bucket:
+
+```sql
+CREATE EXTERNAL TABLE test
+STORED AS PARQUET
+OPTIONS(
+ 'aws.access_key_id' '******',
+ 'aws.secret_access_key' '******',
+ 'aws.region' 'us-east-2'
+)
+LOCATION 's3://bucket/path/file.parquet';
+```
+
+It is also possible to specify the access information using environment
variables:
+
+```bash
+$ export AWS_DEFAULT_REGION=us-east-2
+$ export AWS_SECRET_ACCESS_KEY=******
+$ export AWS_ACCESS_KEY_ID=******
+
+$ datafusion-cli
+`datafusion-cli v21.0.0
+> create external table test stored as parquet location
's3://bucket/path/file.parquet';
+0 rows in set. Query took 0.374 seconds.
+> select * from test;
++----------+----------+
+| column_1 | column_2 |
++----------+----------+
+| 1 | 2 |
++----------+----------+
+1 row in set. Query took 0.171 seconds.
+```
+
+Supported configuration options are:
+
+| Environment Variable | Configuration Option |
Description |
+| ---------------------------------------- | ----------------------- |
--------------------------------------------------------- |
+| `AWS_ACCESS_KEY_ID` | `aws.access_key_id` |
|
+| `AWS_SECRET_ACCESS_KEY` | `aws.secret_access_key` |
|
+| `AWS_DEFAULT_REGION` | `aws.region` |
|
+| `AWS_ENDPOINT` | `aws.endpoint` |
|
+| `AWS_SESSION_TOKEN` | `aws.token` |
|
+| `AWS_CONTAINER_CREDENTIALS_RELATIVE_URI` | | See
[IAM Roles] |
+| `AWS_ALLOW_HTTP` | | set to
"true" to permit HTTP connections without TLS |
+| `AWS_PROFILE` | | Support
for using a [named profile] to supply credentials |
+
+[named profile]:
https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html
+[iam roles]:
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-iam-roles.html
+
+## OSS
+
+[Alibaba cloud
OSS](https://www.alibabacloud.com/product/object-storage-service) data sources
must have connection credentials configured
+
+```sql
+CREATE EXTERNAL TABLE test
+STORED AS PARQUET
+OPTIONS(
+ 'aws.access_key_id' '******',
+ 'aws.secret_access_key' '******',
+ 'aws.oss.endpoint' 'https://bucket.oss-cn-hangzhou.aliyuncs.com'
+)
+LOCATION 'oss://bucket/path/file.parquet';
+```
+
+The supported OPTIONS are
+
+- access_key_id
+- secret_access_key
+- endpoint
+
+Note that the `endpoint` format of oss needs to be:
`https://{bucket}.{oss-region-endpoint}`
+
+## COS
+
+[Tencent cloud COS](https://cloud.tencent.com/product/cos) data sources data
sources must have connection credentials configured
+
+```sql
+CREATE EXTERNAL TABLE test
+STORED AS PARQUET
+OPTIONS(
+ 'aws.access_key_id' '******',
+ 'aws.secret_access_key' '******',
+ 'aws.cos.endpoint' 'https://cos.ap-singapore.myqcloud.com'
+)
+LOCATION 'cos://bucket/path/file.parquet';
+```
+
+The supported OPTIONS are:
+
+- access_key_id
+- secret_access_key
+- endpoint
+
+Note that the `endpoint` format of urls must be:
`https://cos.{cos-region-endpoint}`
+
+## GCS
+
+[Google Cloud Storage](https://cloud.google.com/storage) data sources must
have connection credentials configured
+
+For example, to create an external table from a file in a GCS bucket
+
+```sql
+CREATE EXTERNAL TABLE test
+STORED AS PARQUET
+OPTIONS(
+ 'gcp.service_account_path' '/tmp/gcs.json',
+)
+LOCATION 'gs://bucket/path/file.parquet';
+```
+
+It is also possible to specify the access information using environment
variables:
+
+```bash
+$ export GOOGLE_SERVICE_ACCOUNT=/tmp/gcs.json
+
+$ datafusion-cli
+DataFusion CLI v21.0.0
+> create external table test stored as parquet location
'gs://bucket/path/file.parquet';
+0 rows in set. Query took 0.374 seconds.
+> select * from test;
++----------+----------+
+| column_1 | column_2 |
++----------+----------+
+| 1 | 2 |
++----------+----------+
+1 row in set. Query took 0.171 seconds.
+```
+
+Supported configuration options are:
+
+| Environment Variable | Configuration Option |
Description |
+| -------------------------------- | ---------------------------------- |
---------------------------------------- |
+| `GOOGLE_SERVICE_ACCOUNT` | `gcp.service_account_path` |
location of service account file |
+| `GOOGLE_SERVICE_ACCOUNT_PATH` | `gcp.service_account_path` |
(alias) location of service account file |
+| `SERVICE_ACCOUNT` | `gcp.service_account_path` |
(alias) location of service account file |
+| `GOOGLE_SERVICE_ACCOUNT_KEY` | `gcp.service_account_key` | JSON
serialized service account key |
+| `GOOGLE_APPLICATION_CREDENTIALS` | `gcp.application_credentials_path` |
location of application credentials file |
+| `GOOGLE_BUCKET` | |
bucket name |
+| `GOOGLE_BUCKET_NAME` | |
(alias) bucket name |
diff --git a/docs/source/user-guide/cli/index.rst
b/docs/source/user-guide/cli/index.rst
new file mode 100644
index 0000000000..874cfc0eae
--- /dev/null
+++ b/docs/source/user-guide/cli/index.rst
@@ -0,0 +1,27 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements. See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership. The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License. You may obtain a copy of the License at
+
+.. http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied. See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+DataFusion CLI
+==============
+
+.. toctree::
+ :maxdepth: 3
+
+ overview
+ installation
+ usage
+ datasources
diff --git a/docs/source/user-guide/cli/installation.md
b/docs/source/user-guide/cli/installation.md
new file mode 100644
index 0000000000..f7fcd29072
--- /dev/null
+++ b/docs/source/user-guide/cli/installation.md
@@ -0,0 +1,63 @@
+<!---
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements. See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership. The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied. See the License for the
+ specific language governing permissions and limitations
+ under the License.
+-->
+
+# Installation
+
+## Install and run using Cargo
+
+To build and install the latest release of `datafusion-cli` from source, do:
+
+```shell
+cargo install datafusion-cli
+# Updating crates.io index
+# Installing datafusion-cli v37.0.0
+# Updating crates.io index
+# ...
+```
+
+## Install and run using Homebrew (on MacOS)
+
+`datafusion-cli` can also be installed via [Homebrew] (on MacOS) like this:
+
+[homebrew]: https://docs.brew.sh/Installation
+
+```bash
+brew install datafusion
+# ...
+# ==> Pouring datafusion--37.0.0.arm64_sonoma.bottle.tar.gz
+# πΊ /opt/homebrew/Cellar/datafusion/37.0.0: 9 files, 63.0MB
+# ==> Running `brew cleanup datafusion`...
+```
+
+## Run using Docker
+
+There is no officially published Docker image for the DataFusion CLI, so it is
necessary to build from source
+instead.
+
+Use the following commands to clone this repository and build a Docker image
containing the CLI tool. Note
+that there is `.dockerignore` file in the root of the repository that may need
to be deleted in order for
+this to work.
+
+```bash
+git clone https://github.com/apache/arrow-datafusion
+cd arrow-datafusion
+git checkout 12.0.0
+docker build -f datafusion-cli/Dockerfile . --tag datafusion-cli
+docker run -it -v $(your_data_location):/data datafusion-cli
+```
diff --git a/docs/source/user-guide/cli/overview.md
b/docs/source/user-guide/cli/overview.md
new file mode 100644
index 0000000000..86beea0e82
--- /dev/null
+++ b/docs/source/user-guide/cli/overview.md
@@ -0,0 +1,45 @@
+<!---
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements. See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership. The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied. See the License for the
+ specific language governing permissions and limitations
+ under the License.
+-->
+
+# Overview
+
+DataFusion CLI (`datafusion-cli`) is an interactive command-line utility for
executing
+SQL queries against any supported data files.
+
+While intended as an example of how to use DataFusion, `datafusion-cli` offers
a
+full range of SQL and support reading and writing CSV, Parquet, JSON, Arrow and
+Avro, from local files, directories, or remote locations such as S3.
+
+Here is an example of how to run a SQL query against a local file,
`hits.parquet`:
+
+```shell
+$ datafusion-cli
+DataFusion CLI v37.0.0
+> select count(distinct "URL") from 'hits.parquet';
++----------------------------------+
+| COUNT(DISTINCT hits.parquet.URL) |
++----------------------------------+
+| 18342019 |
++----------------------------------+
+1 row(s) fetched.
+Elapsed 1.969 seconds.
+```
+
+For more information, see the [Installation](installation), [Usage
Guide](usage)
+and [Data Sources](datasources) sections.
diff --git a/docs/source/user-guide/cli/usage.md
b/docs/source/user-guide/cli/usage.md
new file mode 100644
index 0000000000..617b462875
--- /dev/null
+++ b/docs/source/user-guide/cli/usage.md
@@ -0,0 +1,288 @@
+<!---
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements. See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership. The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied. See the License for the
+ specific language governing permissions and limitations
+ under the License.
+-->
+
+# Usage
+
+See the current usage using `datafusion-cli --help`:
+
+```bash
+Apache Arrow <[email protected]>
+Command Line Client for DataFusion query engine.
+
+USAGE:
+ datafusion-cli [OPTIONS]
+
+OPTIONS:
+ -b, --batch-size <BATCH_SIZE>
+ The batch size of each query, or use DataFusion default
+
+ -c, --command <COMMAND>...
+ Execute the given command string(s), then exit
+
+ --color
+ Enables console syntax highlighting
+
+ -f, --file <FILE>...
+ Execute commands from file(s), then exit
+
+ --format <FORMAT>
+ [default: table] [possible values: csv, tsv, table, json, nd-json]
+
+ -h, --help
+ Print help information
+
+ -m, --memory-limit <MEMORY_LIMIT>
+ The memory pool limitation (e.g. '10g'), default to None (no limit)
+
+ --maxrows <MAXROWS>
+ The max number of rows to display for 'Table' format
+ [default: 40] [possible values: numbers(0/10/...), inf(no limit)]
+
+ --mem-pool-type <MEM_POOL_TYPE>
+ Specify the memory pool type 'greedy' or 'fair', default to
'greedy'
+
+ -p, --data-path <DATA_PATH>
+ Path to your data, default to current directory
+
+ -q, --quiet
+ Reduce printing other than the results and work quietly
+
+ -r, --rc <RC>...
+ Run the provided files on startup instead of ~/.datafusionrc
+
+ -V, --version
+ Print version information
+```
+
+## Commands
+
+Available commands inside DataFusion CLI are:
+
+- Quit
+
+```bash
+> \q
+```
+
+- Help
+
+```bash
+> \?
+```
+
+- ListTables
+
+```bash
+> \d
+```
+
+- DescribeTable
+
+```bash
+> \d table_name
+```
+
+- QuietMode
+
+```bash
+> \quiet [true|false]
+```
+
+- list function
+
+```bash
+> \h
+```
+
+- Search and describe function
+
+```bash
+> \h function
+```
+
+## Supported SQL
+
+In addition to the normal [SQL supported in DataFusion], `datafusion-cli` also
+supports additional statements and commands:
+
+[sql supported in datafusion]: ../sql/index.rst
+
+### `SHOW ALL [VERBOSE]`
+
+Show configuration options
+
+```SQL
+> show all;
+
++-------------------------------------------------+---------+
+| name | value |
++-------------------------------------------------+---------+
+| datafusion.execution.batch_size | 8192 |
+| datafusion.execution.coalesce_batches | true |
+| datafusion.execution.time_zone | UTC |
+| datafusion.explain.logical_plan_only | false |
+| datafusion.explain.physical_plan_only | false |
+| datafusion.optimizer.filter_null_join_keys | false |
+| datafusion.optimizer.skip_failed_rules | true |
++-------------------------------------------------+---------+
+
+```
+
+### `SHOW <OPTION>>`
+
+Show specific configuration option
+
+```SQL
+> show datafusion.execution.batch_size;
+
++-------------------------------------------------+---------+
+| name | value |
++-------------------------------------------------+---------+
+| datafusion.execution.batch_size | 8192 |
++-------------------------------------------------+---------+
+
+```
+
+### `SET <OPTION> TO <VALUE>`
+
+- Set configuration options
+
+```SQL
+> SET datafusion.execution.batch_size to 1024;
+```
+
+## Configuration Options
+
+All available configuration options can be seen using `SHOW ALL` as described
above.
+
+You can change the configuration options using environment
+variables. `datafusion-cli` looks in the corresponding environment
+variable with an upper case name and all `.` converted to `_`.
+
+For example, to set `datafusion.execution.batch_size` to `1024` you
+would set the `DATAFUSION_EXECUTION_BATCH_SIZE` environment variable
+appropriately:
+
+```shell
+$ DATAFUSION_EXECUTION_BATCH_SIZE=1024 datafusion-cli
+DataFusion CLI v12.0.0
+> show all;
++-------------------------------------------------+---------+
+| name | value |
++-------------------------------------------------+---------+
+| datafusion.execution.batch_size | 1024 |
+| datafusion.execution.coalesce_batches | true |
+| datafusion.execution.time_zone | UTC |
+| datafusion.explain.logical_plan_only | false |
+| datafusion.explain.physical_plan_only | false |
+| datafusion.optimizer.filter_null_join_keys | false |
+| datafusion.optimizer.skip_failed_rules | true |
++-------------------------------------------------+---------+
+8 rows in set. Query took 0.002 seconds.
+```
+
+You can change the configuration options using `SET` statement as well
+
+```shell
+$ datafusion-cli
+DataFusion CLI v13.0.0
+> show datafusion.execution.batch_size;
++---------------------------------+---------+
+| name | value |
++---------------------------------+---------+
+| datafusion.execution.batch_size | 8192 |
++---------------------------------+---------+
+1 row in set. Query took 0.011 seconds.
+
+> set datafusion.execution.batch_size to 1024;
+0 rows in set. Query took 0.000 seconds.
+
+> show datafusion.execution.batch_size;
++---------------------------------+---------+
+| name | value |
++---------------------------------+---------+
+| datafusion.execution.batch_size | 1024 |
++---------------------------------+---------+
+1 row in set. Query took 0.005 seconds.
+```
+
+## Functions
+
+`datafusion-cli` comes with build-in functions that are not included in the
+DataFusion SQL engine. These functions are:
+
+### `parquet_metadata`
+
+The `parquet_metadata` table function can be used to inspect detailed metadata
+about a parquet file such as statistics, sizes, and other information. This can
+be helpful to understand how parquet files are structured.
+
+For example, to see information about the `"WatchID"` column in the
+`hits.parquet` file, you can use:
+
+```sql
+SELECT path_in_schema, row_group_id, row_group_num_rows, stats_min, stats_max,
total_compressed_size
+FROM parquet_metadata('hits.parquet')
+WHERE path_in_schema = '"WatchID"'
+LIMIT 3;
+
++----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
+| path_in_schema | row_group_id | row_group_num_rows | stats_min |
stats_max | total_compressed_size |
++----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
+| "WatchID" | 0 | 450560 | 4611687214012840539 |
9223369186199968220 | 3883759 |
+| "WatchID" | 1 | 612174 | 4611689135232456464 |
9223371478009085789 | 5176803 |
+| "WatchID" | 2 | 344064 | 4611692774829951781 |
9223363791697310021 | 3031680 |
++----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
+3 rows in set. Query took 0.053 seconds.
+```
+
+The returned table has the following columns for each row for each column chunk
+in the file. Please refer to the [Parquet Documentation] for more information.
+
+[parquet documentation]: https://parquet.apache.org/
+
+| column_name | data_type | Description
|
+| ----------------------- | --------- |
---------------------------------------------------------------------------------------------------
|
+| filename | Utf8 | Name of the file
|
+| row_group_id | Int64 | Row group index the column chunk
belongs to |
+| row_group_num_rows | Int64 | Count of rows stored in the row group
|
+| row_group_num_columns | Int64 | Total number of columns in the row
group (same for all row groups) |
+| row_group_bytes | Int64 | Number of bytes used to store the row
group (not including metadata) |
+| column_id | Int64 | ID of the column
|
+| file_offset | Int64 | Offset within the file that this
column chunk's data begins |
+| num_values | Int64 | Total number of values in this column
chunk |
+| path_in_schema | Utf8 | "Path" (column name) of the column
chunk in the schema |
+| type | Utf8 | Parquet data type of the column chunk
|
+| stats_min | Utf8 | The minimum value for this column
chunk, if stored in the statistics, cast to a string |
+| stats_max | Utf8 | The maximum value for this column
chunk, if stored in the statistics, cast to a string |
+| stats_null_count | Int64 | Number of null values in this column
chunk, if stored in the statistics |
+| stats_distinct_count | Int64 | Number of distinct values in this
column chunk, if stored in the statistics |
+| stats_min_value | Utf8 | Same as `stats_min`
|
+| stats_max_value | Utf8 | Same as `stats_max`
|
+| compression | Utf8 | Block level compression (e.g.
`SNAPPY`) used for this column chunk |
+| encodings | Utf8 | All block level encodings (e.g.
`[PLAIN_DICTIONARY, PLAIN, RLE]`) used for this column chunk |
+| index_page_offset | Int64 | Offset in the file of the [`page
index`], if any |
+| dictionary_page_offset | Int64 | Offset in the file of the dictionary
page, if any |
+| data_page_offset | Int64 | Offset in the file of the first data
page, if any |
+| total_compressed_size | Int64 | Number of bytes the column chunk's
data after encoding and compression (what is stored in the file) |
+| total_uncompressed_size | Int64 | Number of bytes the column chunk's
data after encoding |
+
++-------------------------+-----------+-------------+
+
+[`page index`]:
https://github.com/apache/parquet-format/blob/master/PageIndex.md