Re: [PR] Improve documentation for format `OPTIONS` clause [datafusion]

via GitHub Thu, 17 Apr 2025 12:54:30 -0700


alamb commented on code in PR #15708:
URL: https://github.com/apache/datafusion/pull/15708#discussion_r2049366426



##########
docs/source/user-guide/sql/format_options.md:
##########
@@ -0,0 +1,209 @@
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+# Format Options
+
+DataFusion supports customizing how data is read from or written to disk as a 
result of a `COPY`, `INSERT INTO`, or `CREATE EXTERNAL TABLE` statements. There 
are a few special options, file format (e.g., CSV or Parquet) specific options, 
and Parquet column-specific options. In some cases, Options can be specified in 
multiple ways with a set order of precedence.
+
+## Specifying Options and Order of Precedence
+
+Format-related options can be specified in three ways, in decreasing order of 
precedence:
+
+- `CREATE EXTERNAL TABLE` syntax
+- `COPY` option tuples
+- Session-level config defaults
+
+For a list of supported session-level config defaults, see [Configuration 
Settings](../configs). These defaults apply to all operations but have the 
lowest level of precedence.
+
+If creating an external table, table-specific format options can be specified 
when the table is created using the `OPTIONS` clause:
+
+```sql
+CREATE EXTERNAL TABLE
+  my_table(a bigint, b bigint)
+  STORED AS csv
+  LOCATION '/test/location/my_csv_table/'
+  OPTIONS(
+    NULL_VALUE 'NAN',
+    'has_header' 'true',
+    'format.delimiter' ';'
+  );
+```
+
+When running `INSERT INTO my_table ...`, the options from the `CREATE TABLE` 
will be respected (e.g., gzip compression, special delimiter, and header row 
included). Note that compression, header, and delimiter settings can also be 
specified within the `OPTIONS` tuple list. Dedicated syntax within the SQL 
statement always takes precedence over arbitrary option tuples, so if both are 
specified, the `OPTIONS` setting will be ignored.
+
+Finally, options can be passed when running a `COPY` command.
+
+```sql
+COPY source_table
+  TO 'test/table_with_options'
+  PARTITIONED BY (column3, column4)
+  OPTIONS (
+    format parquet,
+    compression snappy,
+    'compression::column1' 'zstd(5)',
+  )
+```
+
+In this example, we write the entirety of `source_table` out to a folder of 
Parquet files. One Parquet file will be written in parallel to the folder for 
each partition in the query. The next option `compression` set to `snappy` 
indicates that unless otherwise specified, all columns should use the snappy 
compression codec. The option `compression::col1` sets an override, so that the 
column `col1` in the Parquet file will use the ZSTD compression codec with 
compression level `5`. In general, Parquet options that support column-specific 
settings can be specified with the syntax `OPTION::COLUMN.NESTED.PATH`.
+
+# Available Options
+
+## Execution-Specific Options
+
+The following options are available when executing a `COPY` query.
+
+| Option Key                                                  | Description    
                                                                                
              | Default Value |

Review Comment:
   This largely replicates what is already in this page: 
https://datafusion.apache.org/user-guide/configs.html
   
   I think the idea is that the format_options page will describe options 
specific to formats -- I will push a small commit to fix



##########
docs/source/user-guide/sql/format_options.md:
##########
@@ -0,0 +1,209 @@
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+# Format Options
+
+DataFusion supports customizing how data is read from or written to disk as a 
result of a `COPY`, `INSERT INTO`, or `CREATE EXTERNAL TABLE` statements. There 
are a few special options, file format (e.g., CSV or Parquet) specific options, 
and Parquet column-specific options. In some cases, Options can be specified in 
multiple ways with a set order of precedence.
+
+## Specifying Options and Order of Precedence
+
+Format-related options can be specified in three ways, in decreasing order of 
precedence:
+
+- `CREATE EXTERNAL TABLE` syntax
+- `COPY` option tuples
+- Session-level config defaults
+
+For a list of supported session-level config defaults, see [Configuration 
Settings](../configs). These defaults apply to all operations but have the 
lowest level of precedence.
+
+If creating an external table, table-specific format options can be specified 
when the table is created using the `OPTIONS` clause:
+
+```sql
+CREATE EXTERNAL TABLE
+  my_table(a bigint, b bigint)
+  STORED AS csv
+  LOCATION '/test/location/my_csv_table/'
+  OPTIONS(
+    NULL_VALUE 'NAN',
+    'has_header' 'true',
+    'format.delimiter' ';'
+  );
+```
+
+When running `INSERT INTO my_table ...`, the options from the `CREATE TABLE` 
will be respected (e.g., gzip compression, special delimiter, and header row 
included). Note that compression, header, and delimiter settings can also be 
specified within the `OPTIONS` tuple list. Dedicated syntax within the SQL 
statement always takes precedence over arbitrary option tuples, so if both are 
specified, the `OPTIONS` setting will be ignored.
+
+Finally, options can be passed when running a `COPY` command.
+
+```sql
+COPY source_table
+  TO 'test/table_with_options'
+  PARTITIONED BY (column3, column4)
+  OPTIONS (
+    format parquet,
+    compression snappy,
+    'compression::column1' 'zstd(5)',
+  )
+```
+
+In this example, we write the entirety of `source_table` out to a folder of 
Parquet files. One Parquet file will be written in parallel to the folder for 
each partition in the query. The next option `compression` set to `snappy` 
indicates that unless otherwise specified, all columns should use the snappy 
compression codec. The option `compression::col1` sets an override, so that the 
column `col1` in the Parquet file will use the ZSTD compression codec with 
compression level `5`. In general, Parquet options that support column-specific 
settings can be specified with the syntax `OPTION::COLUMN.NESTED.PATH`.
+
+# Available Options
+
+## Execution-Specific Options
+
+The following options are available when executing a `COPY` query.
+
+| Option Key                                                  | Description    
                                                                                
              | Default Value |
+| ----------------------------------------------------------- | 
------------------------------------------------------------------------------------------------------------
 | ------------- |
+| execution.batch_size                                        | Default batch 
size while creating new batches.                                                
               | 8192          |
+| execution.coalesce_batches                                  | Whether to 
coalesce small batches into larger ones between operators.                      
                  | true          |
+| execution.collect_statistics                                | Should 
DataFusion collect statistics after listing files.                              
                      | false         |
+| execution.target_partitions                                 | Number of 
partitions for query execution. Defaults to number of CPU cores.                
                   | (num_cpus)    |
+| execution.time_zone                                         | The default 
time zone (e.g., "+00:00").                                                     
                 | "+00:00"      |
+| execution.planning_concurrency                              | Fan-out during 
initial physical planning. Defaults to number of CPU cores.                     
              | (num_cpus)    |
+| execution.skip_physical_aggregate_schema_check              | Skip verifying 
that the schema produced by planning the input of `Aggregate` matches the input 
plan.         | false         |
+| execution.sort_spill_reservation_bytes                      | Reserved 
memory for each spillable sort operation (bytes).                               
                    | 10485760      |
+| execution.sort_in_place_threshold_bytes                     | Below what 
size should data be concatenated and sorted in a single RecordBatch (bytes).    
                  | 1048576       |
+| execution.meta_fetch_concurrency                            | Number of 
files to read in parallel when inferring schema and statistics.                 
                   | 32            |
+| execution.minimum_parallel_output_files                     | Guarantees a 
minimum level of output files running in parallel.                              
                | 4             |
+| execution.soft_max_rows_per_output_file                     | Target number 
of rows in output files when writing multiple (soft max).                       
               | 50000000      |
+| execution.max_buffered_batches_per_output_file              | Maximum number 
of RecordBatches buffered for each output file.                                 
              | 2             |
+| execution.listing_table_ignore_subdirectory                 | Should 
subdirectories be ignored when scanning directories for data files.             
                      | true          |
+| execution.enable_recursive_ctes                             | Should 
DataFusion support recursive CTEs.                                              
                      | true          |
+| execution.split_file_groups_by_statistics                   | Attempt to 
eliminate sorts by packing & sorting files with non-overlapping statistics into 
same file groups. | false         |
+| execution.keep_partition_by_columns                         | Should 
DataFusion keep the columns used for partition_by in the output RecordBatches.  
                      | false         |
+| execution.skip_partial_aggregation_probe_ratio_threshold    | Aggregation 
ratio threshold for skipping partial aggregation.                               
                 | 0.8           |
+| execution.skip_partial_aggregation_probe_rows_threshold     | Number of 
input rows partial aggregation partition should process before switching mode.  
                   | 100000        |
+| execution.use_row_number_estimates_to_optimize_partitioning | Use row number 
estimates to optimize partitioning.                                             
              | false         |
+| execution.enforce_batch_size_in_joins                       | Should 
DataFusion enforce batch size in joins.                                         
                      | false         |
+
+**Usage Example:**
+
+```sql
+COPY my_table TO '/tmp/output.csv'
+OPTIONS (
+  'execution.batch_size' '4096',
+  'execution.target_partitions' '8',
+  'execution.keep_partition_by_columns' 'true'
+);
+```
+
+Note: `execution.keep_partition_by_columns` flag can also be enabled through 
`ExecutionOptions` within `SessionConfig`.
+
+## JSON Format Options
+
+The following options are available when reading or writing JSON files. Note: 
If any unsupported option is specified, an error will be raised and the query 
will fail.
+
+| Option      | Description                                                    
                                                                    | Default 
Value |
+| ----------- | 
----------------------------------------------------------------------------------------------------------------------------------
 | ------------- |
+| COMPRESSION | Sets the compression that should be applied to the entire JSON 
file. Supported values are GZIP, BZIP2, XZ, ZSTD, and UNCOMPRESSED. | 
UNCOMPRESSED  |
+
+**Example:**
+
+```sql
+CREATE EXTERNAL TABLE t(a int)
+STORED AS JSON
+LOCATION '/tmp/foo/'
+OPTIONS('COMPRESSION' 'gzip');
+-- Inserting arow creates a new file in /tmp/foo
+INSERT INTO t VALUES(1);
+```
+
+## CSV Format Options
+
+The following options are available when reading or writing CSV files. Note: 
If any unsupported option is specified, an error will be raised and the query 
will fail.
+
+| Option               | Description                                           
                                                                            | 
Default Value      |
+| -------------------- | 
---------------------------------------------------------------------------------------------------------------------------------
 | ------------------ |
+| COMPRESSION          | Sets the compression that should be applied to the 
entire CSV file. Supported values are GZIP, BZIP2, XZ, ZSTD, and UNCOMPRESSED. 
| UNCOMPRESSED       |
+| HAS_HEADER           | Sets if the CSV file should include column headers. 
If not set, uses session or system default.                                   | 
None               |
+| DELIMITER            | Sets the character which should be used as the column 
delimiter within the CSV file.                                              | 
`,` (comma)        |
+| QUOTE                | Sets the character which should be used for quoting 
values within the CSV file.                                                   | 
`"` (double quote) |
+| TERMINATOR           | Sets the character which should be used as the line 
terminator within the CSV file.                                               | 
None               |
+| ESCAPE               | Sets the character which should be used for escaping 
special characters within the CSV file.                                      | 
None               |
+| DOUBLE_QUOTE         | Sets if quotes within quoted fields should be escaped 
by doubling them (e.g., `"aaa""bbb"`).                                      | 
None               |
+| NEWLINES_IN_VALUES   | Sets if newlines in quoted values are supported. If 
not set, uses session or system default.                                      | 
None               |
+| DATE_FORMAT          | Sets the format that dates should be encoded in 
within the CSV file.                                                            
  | None               |
+| DATETIME_FORMAT      | Sets the format that datetimes should be encoded in 
within the CSV file.                                                          | 
None               |
+| TIMESTAMP_FORMAT     | Sets the format that timestamps should be encoded in 
within the CSV file.                                                         | 
None               |
+| TIMESTAMP_TZ_FORMAT  | Sets the format that timestamps with timezone should 
be encoded in within the CSV file.                                           | 
None               |
+| TIME_FORMAT          | Sets the format that times should be encoded in 
within the CSV file.                                                            
  | None               |
+| NULL_VALUE           | Sets the string which should be used to indicate null 
values within the CSV file.                                                 | 
None               |
+| NULL_REGEX           | Sets the regex pattern to match null values when 
loading CSVs.                                                                   
 | None               |
+| SCHEMA_INFER_MAX_REC | Sets the maximum number of records to scan to infer 
the schema.                                                                   | 
None               |
+| COMMENT              | Sets the character which should be used to indicate 
comment lines in the CSV file.                                                | 
None               |
+
+**Example:**
+
+```sql
+CREATE EXTERNAL TABLE t (col1 varchar, col2 int, col3 boolean)
+STORED AS CSV
+LOCATION '/tmp/foo/'
+OPTIONS('DELIMITER' '|', 'HAS_HEADER' 'true', 'NEWLINES_IN_VALUES' 'true');
+```
+
+## Parquet Format Options

Review Comment:
   👍  this looks great



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Improve documentation for format `OPTIONS` clause [datafusion]

Reply via email to