This is an automated email from the ASF dual-hosted git repository.
alamb pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-datafusion.git
The following commit(s) were added to refs/heads/main by this push:
new 561e0d7e87 Documentation Updates for New Write Related Features (#7520)
561e0d7e87 is described below
commit 561e0d7e87825aba224bf2eb9c3b8b5e1b725597
Author: Devin D'Angelo <[email protected]>
AuthorDate: Tue Sep 12 09:57:30 2023 -0400
Documentation Updates for New Write Related Features (#7520)
* initial update
* add external table options
* add parquet docs
* md to html in links
* prettier
* edit SINGLE_FILE_OUPUT description
* Apply suggestions from code review
Co-authored-by: Alex Huang <[email protected]>
* prettier
* Apply suggestions from code review
Co-authored-by: Andrew Lamb <[email protected]>
* add link
---------
Co-authored-by: Alex Huang <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
---
docs/source/user-guide/sql/ddl.md | 4 +-
docs/source/user-guide/sql/dml.md | 12 +--
docs/source/user-guide/sql/index.rst | 1 +
docs/source/user-guide/sql/write_options.md | 131 ++++++++++++++++++++++++++++
4 files changed, 138 insertions(+), 10 deletions(-)
diff --git a/docs/source/user-guide/sql/ddl.md
b/docs/source/user-guide/sql/ddl.md
index 54a9af6c23..b67d323126 100644
--- a/docs/source/user-guide/sql/ddl.md
+++ b/docs/source/user-guide/sql/ddl.md
@@ -64,7 +64,7 @@ STORED AS <file_type>
[ DELIMITER <char> ]
[ COMPRESSION TYPE <GZIP | BZIP2 | XZ | ZSTD> ]
[ PARTITIONED BY (<column list>) ]
-[ WITH ORDER (<ordered column list>)
+[ WITH ORDER (<ordered column list>) ]
[ OPTIONS (<key_value_list>) ]
LOCATION <literal>
@@ -77,6 +77,8 @@ LOCATION <literal>
<key_value_list> := (<literal> <literal, <literal> <literal>, ...)
```
+For a detailed list of write related options which can be passed in the
OPTIONS key_value_list, see [Write Options](write_options).
+
`file_type` is one of `CSV`, `ARROW`, `PARQUET`, `AVRO` or `JSON`
`LOCATION <literal>` specifies the location to find the data. It can be
diff --git a/docs/source/user-guide/sql/dml.md
b/docs/source/user-guide/sql/dml.md
index 9794fba4aa..c3226936e7 100644
--- a/docs/source/user-guide/sql/dml.md
+++ b/docs/source/user-guide/sql/dml.md
@@ -28,18 +28,12 @@ Copies the contents of a table or query to file(s).
Supported file
formats are `parquet`, `csv`, and `json` and can be inferred based on
filename if writing to a single file.
-The `PER_THREAD_OUTPUT` option treats `file_name` as a directory and writes a
file per thread within it.
-
<pre>
COPY { <i><b>table_name</i></b> | <i><b>query</i></b> } TO
'<i><b>file_name</i></b>' [ ( <i><b>option</i></b> [, ... ] ) ]
-
-where <i><b>option</i></b> can be one of:
- FORMAT <i><b>format_name</i></b>
- PER_THREAD_OUTPUT <i><b>boolean</i></b>
- ROW_GROUP_SIZE <i><b>integer</i></b>
- ROW_GROUP_LIMIT_BYTES <i><b>integer</i></b>
</pre>
+For a detailed list of valid OPTIONS, see [Write Options](write_options).
+
Copy the contents of `source_table` to `file_name.json` in JSON format:
```sql
@@ -55,7 +49,7 @@ Copy the contents of `source_table` to one or more Parquet
formatted
files in the `dir_name` directory:
```sql
-> COPY source_table TO 'dir_name' (FORMAT parquet, PER_THREAD_OUTPUT true);
+> COPY source_table TO 'dir_name' (FORMAT parquet, SINGLE_FILE_OUTPUT false);
+-------+
| count |
+-------+
diff --git a/docs/source/user-guide/sql/index.rst
b/docs/source/user-guide/sql/index.rst
index b282fbddd6..04d1fc228f 100644
--- a/docs/source/user-guide/sql/index.rst
+++ b/docs/source/user-guide/sql/index.rst
@@ -33,3 +33,4 @@ SQL Reference
window_functions
scalar_functions
sql_status
+ write_options
diff --git a/docs/source/user-guide/sql/write_options.md
b/docs/source/user-guide/sql/write_options.md
new file mode 100644
index 0000000000..c98a39f24b
--- /dev/null
+++ b/docs/source/user-guide/sql/write_options.md
@@ -0,0 +1,131 @@
+<!---
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements. See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership. The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied. See the License for the
+ specific language governing permissions and limitations
+ under the License.
+-->
+
+# Write Options
+
+DataFusion supports customizing how data is written out to disk as a result of
a `COPY` or `INSERT INTO` query. There are a few special options, file format
(e.g. CSV or parquet) specific options, and parquet column specific options.
Options can also in some cases be specified in multiple ways with a set order
of precedence.
+
+## Specifying Options and Order of Precedence
+
+Write related options can be specified in the following ways:
+
+- Session level config defaults
+- `CREATE EXTERNAL TABLE` options
+- `COPY` option tuples
+
+For a list of supported session level config defaults see [Configuration
Settings](configs). These defaults apply to all write operations but have the
lowest level of precedence.
+
+If inserting to an external table, table specific write options can be
specified when the table is created using the `OPTIONS` clause:
+
+```sql
+CREATE EXTERNAL TABLE
+my_table(a bigint, b bigint)
+STORED AS csv
+COMPRESSION TYPE gzip
+WITH HEADER ROW
+DELIMITER ';'
+LOCATION '/test/location/my_csv_table/'
+OPTIONS(
+CREATE_LOCAL_PATH 'true',
+NULL_VALUE 'NAN'
+);
+```
+
+When running `INSERT INTO my_table ...`, the options from the `CREATE TABLE`
will be respected (gzip compression, special delimiter, and header row
included). Note that compression, header, and delimiter settings can also be
specified within the `OPTIONS` tuple list. Dedicated syntax within the SQL
statement always takes precedence over arbitrary option tuples, so if both are
specified the `OPTIONS` setting will be ignored. CREATE_LOCAL_PATH is a special
option that indicates if DataFusi [...]
+
+Finally, options can be passed when running a `COPY` command.
+
+```sql
+COPY source_table
+TO 'test/table_with_options'
+(format parquet,
+single_file_output false,
+compression snappy,
+'compression::col1' 'zstd(5)',
+)
+```
+
+In this example, we write the entirety of `source_table` out to a folder of
parquet files. The option `single_file_output` set to false, indicates that the
destination path should be interpreted as a folder to which the query will
output multiple files. One parquet file will be written in parallel to the
folder for each partition in the query. The next option `compression` set to
`snappy` indicates that unless otherwise specified all columns should use the
snappy compression codec. The o [...]
+
+## Available Options
+
+### COPY Specific Options
+
+The following special options are specific to the `COPY` command.
+
+| Option | Description
| Default Value |
+| ------------------ |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| ------------- |
+| SINGLE_FILE_OUTPUT | If true, COPY query will write output to a single file.
Otherwise, multiple files will be written to a directory in parallel.
| true |
+| FORMAT | Specifies the file format COPY query will write out. If
single_file_output is false or the format cannot be inferred from the file
extension, then FORMAT must be specified. | N/A |
+
+### CREATE EXTERNAL TABLE Specific Options
+
+The following special options are specific to creating an external table.
+
+| Option | Description
| Default Value
|
+| ----------------- |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| ----------------------------------------------------------------------------
|
+| SINGLE_FILE | If true, indicates that this external table is backed by
a single file. INSERT INTO queries will append to this file.
| false
|
+| CREATE_LOCAL_PATH | If true, the folder or file backing this table will be
created on the local file system if it does not already exist when running
INSERT INTO queries.
| false
|
+| INSERT_MODE | Determines if INSERT INTO queries should append to
existing files or append new files to an existing directory. Valid values are
append_to_file, append_new_files, and error. Note that "error" will block
inserting data into this table. | CSV and JSON default to append_to_file.
Parquet defaults to append_new_files |
+
+### JSON Format Specific Options
+
+The following options are available when writing JSON files. Note: If any
unsupported option is specified, an error will be raised and the query will
fail.
+
+| Option | Description
| Default
Value |
+| ----------- |
----------------------------------------------------------------------------------------------------------------------------------
| ------------- |
+| COMPRESSION | Sets the compression that should be applied to the entire JSON
file. Supported values are GZIP, BZIP2, XZ, ZSTD, and UNCOMPRESSED. |
UNCOMPRESSED |
+
+### CSV Format Sepcific Options
+
+The following options are available when writing CSV files. Note: if any
unsupported options is specified an error will be raised and the query will
fail.
+
+| Option | Description
|
Default Value |
+| --------------- |
---------------------------------------------------------------------------------------------------------------------------------
| ---------------- |
+| COMPRESSION | Sets the compression that should be applied to the entire
CSV file. Supported values are GZIP, BZIP2, XZ, ZSTD, and UNCOMPRESSED. |
UNCOMPRESSED |
+| HEADER | Sets if the CSV file should include column headers
| false
|
+| DATE_FORMAT | Sets the format that dates should be encoded in within the
CSV file |
arrow-rs default |
+| DATETIME_FORMAT | Sets the format that datetimes should be encoded in within
the CSV file |
arrow-rs default |
+| TIME_FORMAT | Sets the format that times should be encoded in within the
CSV file |
arrow-rs default |
+| RFC3339 | If true, uses RFC339 format for date and time encodings
|
arrow-rs default |
+| NULL_VALUE | Sets the string which should be used to indicate null
values within the CSV file. |
arrow-rs default |
+| DELIMITER | Sets the character which should be used as the column
delimiter within the CSV file. |
arrow-rs default |
+
+### Parquet Format Specific Options
+
+The following options are available when writing parquet files. If any
unsupported option is specified an error will be raised and the query will
fail. If a column specific option is specified for a column which does not
exist, the option will be ignored without error. For default values, see:
[Configuration
Settings](https://arrow.apache.org/datafusion/user-guide/configs.html).
+
+| Option | Can be Column Specific? | Description
|
+| ---------------------------- | ----------------------- |
-------------------------------------------------------------------------------------------------------------
|
+| COMPRESSION | Yes | Sets the
compression codec and if applicable compression level to use
|
+| MAX_ROW_GROUP_SIZE | No | Sets the maximum
number of rows that can be encoded in a single row group
|
+| DATA_PAGESIZE_LIMIT | No | Sets the best
effort maximum page size in bytes
|
+| WRITE_BATCH_SIZE | No | Maximum number of
rows written for each column in a single batch
|
+| WRITER_VERSION | No | Parquet writer
version (1.0 or 2.0)
|
+| DICTIONARY_PAGE_SIZE_LIMIT | No | Sets best effort
maximum dictionary page size in bytes
|
+| CREATED_BY | No | Sets the "created
by" property in the parquet file
|
+| COLUMN_INDEX_TRUNCATE_LENGTH | No | Sets the max length
of min/max value fields in the column index.
|
+| DATA_PAGE_ROW_COUNT_LIMIT | No | Sets best effort
maximum number of rows in a data page.
|
+| BLOOM_FILTER_ENABLED | Yes | Sets whether a
bloom filter should be written into the file.
|
+| ENCODING | Yes | Sets the encoding
that should be used (e.g. PLAIN or RLE)
|
+| DICTIONARY_ENABLED | Yes | Sets if dictionary
encoding is enabled. Use this instead of ENCODING to set dictionary encoding.
|
+| STATISTICS_ENABLED | Yes | Sets if statistics
are enabled at PAGE or ROW_GROUP level.
|
+| MAX_STATISTICS_SIZE | Yes | Sets the maximum
size in bytes that statistics can take up.
|
+| BLOOM_FILTER_FPP | Yes | Sets the false
positive probability (fpp) for the bloom filter. Implicitly sets
BLOOM_FILTER_ENABLED to true. |
+| BLOOM_FILTER_NDV | Yes | Sets the number of
distinct values (ndv) for the bloom filter. Implicitly sets
bloom_filter_enabled to true. |