alamb commented on code in PR #7520: URL: https://github.com/apache/arrow-datafusion/pull/7520#discussion_r1321877939
########## docs/source/user-guide/sql/ddl.md: ########## @@ -77,6 +77,8 @@ LOCATION <literal> <key_value_list> := (<literal> <literal, <literal> <literal>, ...) ``` +For a detailed list of write related options which can be passed in the OPTIONS key_value_list, see [Write Options](https://arrow.apache.org/datafusion/user-guide/sql/write_options.html). Review Comment: I think we can make a local link like this: ```suggestion For a detailed list of write related options which can be passed in the OPTIONS key_value_list, see [Write Options](write_options). ``` ########## docs/source/user-guide/sql/dml.md: ########## @@ -28,18 +28,12 @@ Copies the contents of a table or query to file(s). Supported file formats are `parquet`, `csv`, and `json` and can be inferred based on filename if writing to a single file. -The `PER_THREAD_OUTPUT` option treats `file_name` as a directory and writes a file per thread within it. - <pre> COPY { <i><b>table_name</i></b> | <i><b>query</i></b> } TO '<i><b>file_name</i></b>' [ ( <i><b>option</i></b> [, ... ] ) ] - -where <i><b>option</i></b> can be one of: - FORMAT <i><b>format_name</i></b> - PER_THREAD_OUTPUT <i><b>boolean</i></b> - ROW_GROUP_SIZE <i><b>integer</i></b> - ROW_GROUP_LIMIT_BYTES <i><b>integer</i></b> </pre> +For a detailed list of valid OPTIONS, see [Write Options](https://arrow.apache.org/datafusion/user-guide/sql/write_options.html). Review Comment: ```suggestion For a detailed list of valid OPTIONS, see [Write Options](write_options). ``` ########## docs/source/user-guide/sql/write_options.md: ########## @@ -0,0 +1,131 @@ +<!--- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> + +# Write Options + +DataFusion supports customizing how data is written out to disk as a result of a `COPY` or `INSERT INTO` query. There are a few special options, file format (e.g. CSV or parquet) specific options, and parquet column specific options. Options can also in some cases be specified in multiple ways with a set order of precedence. + +## Specifying Options and Order of Precedence + +Write related options can be specified in the following ways: + +- Session level config defaults +- `CREATE EXTERNAL TABLE` options +- `COPY` option tuples + +For a list of supported session level config defaults see [Configuration Settings](https://arrow.apache.org/datafusion/user-guide/configs.html). These defaults apply to all write operations but have the lowest level of precedence. + +If inserting to an external table, table specific write options can be specified when the table is created: Review Comment: ```suggestion If inserting to an external table, table specific write options can be specified when the table is created using the `OPTIONS` clause: ``` ########## docs/source/user-guide/sql/write_options.md: ########## @@ -0,0 +1,131 @@ +<!--- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> + +# Write Options + +DataFusion supports customizing how data is written out to disk as a result of a `COPY` or `INSERT INTO` query. There are a few special options, file format (e.g. CSV or parquet) specific options, and parquet column specific options. Options can also in some cases be specified in multiple ways with a set order of precedence. + +## Specifying Options and Order of Precedence + +Write related options can be specified in the following ways: + +- Session level config defaults +- `CREATE EXTERNAL TABLE` options +- `COPY` option tuples + +For a list of supported session level config defaults see [Configuration Settings](https://arrow.apache.org/datafusion/user-guide/configs.html). These defaults apply to all write operations but have the lowest level of precedence. + +If inserting to an external table, table specific write options can be specified when the table is created: + +```sql +CREATE EXTERNAL TABLE +my_table(a bigint, b bigint) +STORED AS csv +COMPRESSION TYPE gzip +WITH HEADER ROW +DELIMITER ';' +LOCATION '/test/location/my_csv_table/' +OPTIONS( +CREATE_LOCAL_PATH 'true', +NULL_VALUE 'NAN' +); +``` + +When running `INSERT INTO my_table ...`, the above specified options will be respected (gzip compression, special delimiter, and header row included). Note that compression, header, and delimiter settings can also be specified within the `OPTIONS` tuple list. Dedicated syntax within the SQL statement always takes precedence over arbitrary option tuples, so if both are specified the `OPTIONS` setting will be ignored. CREATE_LOCAL_PATH is a special option that indicates if DataFusion should create local file paths when writing new files if they do not already exist. This option is useful if you wish to create an external table from scratch, using only DataFusion SQL statements. Finally, NULL_VALUE is a CSV format specific option that determines how null values should be encoded within the CSV file. + +Finally, options can be passed when running a `COPY` command. + +```sql +COPY source_table +TO 'test/table_with_options' +(format parquet, +single_file_output false, +compression snappy, +'compression::col1' 'zstd(5)', +) +``` + +In this example, we write the entirety of `source_table` out to a folder of parquet files. The option `single_file_output` set to false, indicates that the destination path should be interpreted as a folder to which the query will output multiple files. One parquet file will be written in parallel to the folder for each partition in the query. The next option `compression` set to `snappy` indicates that unless otherwise specified all columns should use the snappy compression codec. The option `compression::col1` sets an override, so that the column `col1` in the parquet file will use `ZSTD` compression codec with compression level `5`. In general, parquet option which support column specific settings can be specified with the syntax `OPTION::COLUMN.NESTED.PATH`. + +## Available Options + +### COPY Specific Options + +The following special options are specific to the `COPY` query. Review Comment: ```suggestion The following special options are specific to the `COPY` command. ``` ########## docs/source/user-guide/sql/dml.md: ########## @@ -55,7 +49,7 @@ Copy the contents of `source_table` to one or more Parquet formatted files in the `dir_name` directory: ```sql -> COPY source_table TO 'dir_name' (FORMAT parquet, PER_THREAD_OUTPUT true); Review Comment: 👍 ########## docs/source/user-guide/sql/write_options.md: ########## @@ -0,0 +1,131 @@ +<!--- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> + +# Write Options + +DataFusion supports customizing how data is written out to disk as a result of a `COPY` or `INSERT INTO` query. There are a few special options, file format (e.g. CSV or parquet) specific options, and parquet column specific options. Options can also in some cases be specified in multiple ways with a set order of precedence. + +## Specifying Options and Order of Precedence + +Write related options can be specified in the following ways: + +- Session level config defaults +- `CREATE EXTERNAL TABLE` options +- `COPY` option tuples + +For a list of supported session level config defaults see [Configuration Settings](https://arrow.apache.org/datafusion/user-guide/configs.html). These defaults apply to all write operations but have the lowest level of precedence. Review Comment: ```suggestion For a list of supported session level config defaults see [Configuration Settings](configs). These defaults apply to all write operations but have the lowest level of precedence. ``` ########## docs/source/user-guide/sql/write_options.md: ########## @@ -0,0 +1,131 @@ +<!--- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> + +# Write Options + +DataFusion supports customizing how data is written out to disk as a result of a `COPY` or `INSERT INTO` query. There are a few special options, file format (e.g. CSV or parquet) specific options, and parquet column specific options. Options can also in some cases be specified in multiple ways with a set order of precedence. + +## Specifying Options and Order of Precedence + +Write related options can be specified in the following ways: + +- Session level config defaults +- `CREATE EXTERNAL TABLE` options +- `COPY` option tuples + +For a list of supported session level config defaults see [Configuration Settings](https://arrow.apache.org/datafusion/user-guide/configs.html). These defaults apply to all write operations but have the lowest level of precedence. + +If inserting to an external table, table specific write options can be specified when the table is created: + +```sql +CREATE EXTERNAL TABLE +my_table(a bigint, b bigint) +STORED AS csv +COMPRESSION TYPE gzip +WITH HEADER ROW +DELIMITER ';' +LOCATION '/test/location/my_csv_table/' +OPTIONS( +CREATE_LOCAL_PATH 'true', +NULL_VALUE 'NAN' +); +``` + +When running `INSERT INTO my_table ...`, the above specified options will be respected (gzip compression, special delimiter, and header row included). Note that compression, header, and delimiter settings can also be specified within the `OPTIONS` tuple list. Dedicated syntax within the SQL statement always takes precedence over arbitrary option tuples, so if both are specified the `OPTIONS` setting will be ignored. CREATE_LOCAL_PATH is a special option that indicates if DataFusion should create local file paths when writing new files if they do not already exist. This option is useful if you wish to create an external table from scratch, using only DataFusion SQL statements. Finally, NULL_VALUE is a CSV format specific option that determines how null values should be encoded within the CSV file. Review Comment: ```suggestion When running `INSERT INTO my_table ...`, the options from the `CREATE TABLE` will be respected (gzip compression, special delimiter, and header row included). Note that compression, header, and delimiter settings can also be specified within the `OPTIONS` tuple list. Dedicated syntax within the SQL statement always takes precedence over arbitrary option tuples, so if both are specified the `OPTIONS` setting will be ignored. CREATE_LOCAL_PATH is a special option that indicates if DataFusion should create local file paths when writing new files if they do not already exist. This option is useful if you wish to create an external table from scratch, using only DataFusion SQL statements. Finally, NULL_VALUE is a CSV format specific option that determines how null values should be encoded within the CSV file. ``` ########## docs/source/user-guide/sql/dml.md: ########## @@ -55,7 +49,7 @@ Copy the contents of `source_table` to one or more Parquet formatted files in the `dir_name` directory: ```sql -> COPY source_table TO 'dir_name' (FORMAT parquet, PER_THREAD_OUTPUT true); Review Comment: 👍 ########## docs/source/user-guide/sql/write_options.md: ########## @@ -0,0 +1,131 @@ +<!--- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> + +# Write Options Review Comment: I think it might be a good idea to add a link to this page into the index https://github.com/apache/arrow-datafusion/blob/main/docs/source/user-guide/sql/index.rst so it show up in the left hand nav bar -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
