Re: [PR] Improve documentation for format `OPTIONS` clause [datafusion]

via GitHub Wed, 16 Apr 2025 13:17:22 -0700


alamb commented on code in PR #15708:
URL: https://github.com/apache/datafusion/pull/15708#discussion_r2047668929



##########
docs/source/user-guide/sql/write_options.md:
##########
@@ -1,127 +0,0 @@
-<!---
-  Licensed to the Apache Software Foundation (ASF) under one
-  or more contributor license agreements.  See the NOTICE file
-  distributed with this work for additional information
-  regarding copyright ownership.  The ASF licenses this file
-  to you under the Apache License, Version 2.0 (the
-  "License"); you may not use this file except in compliance
-  with the License.  You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-  Unless required by applicable law or agreed to in writing,
-  software distributed under the License is distributed on an
-  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-  KIND, either express or implied.  See the License for the
-  specific language governing permissions and limitations
-  under the License.
--->
-
-# Write Options
-
-DataFusion supports customizing how data is written out to disk as a result of 
a `COPY` or `INSERT INTO` query. There are a few special options, file format 
(e.g. CSV or parquet) specific options, and parquet column specific options. 
Options can also in some cases be specified in multiple ways with a set order 
of precedence.
-
-## Specifying Options and Order of Precedence
-
-Write related options can be specified in the following ways:
-
-- Session level config defaults
-- `CREATE EXTERNAL TABLE` options
-- `COPY` option tuples
-
-For a list of supported session level config defaults see [Configuration 
Settings](../configs). These defaults apply to all write operations but have 
the lowest level of precedence.
-
-If inserting to an external table, table specific write options can be 
specified when the table is created using the `OPTIONS` clause:
-
-```sql
-CREATE EXTERNAL TABLE
-  my_table(a bigint, b bigint)
-  STORED AS csv
-  COMPRESSION TYPE gzip
-  LOCATION '/test/location/my_csv_table/'
-  OPTIONS(
-    NULL_VALUE 'NAN',
-    'has_header' 'true',
-    'format.delimiter' ';'
-  )
-```
-
-When running `INSERT INTO my_table ...`, the options from the `CREATE TABLE` 
will be respected (gzip compression, special delimiter, and header row 
included). There will be a single output file if the output path doesn't have 
folder format, i.e. ending with a `\`. Note that compression, header, and 
delimiter settings can also be specified within the `OPTIONS` tuple list. 
Dedicated syntax within the SQL statement always takes precedence over 
arbitrary option tuples, so if both are specified the `OPTIONS` setting will be 
ignored. NULL_VALUE is a CSV format specific option that determines how null 
values should be encoded within the CSV file.
-
-Finally, options can be passed when running a `COPY` command.
-
-<!--
- Test the following example with:
- CREATE TABLE source_table AS VALUES ('1','2','3','4');
--->
-
-```sql
-COPY source_table
-  TO 'test/table_with_options'
-  PARTITIONED BY (column3, column4)
-  OPTIONS (
-    format parquet,
-    compression snappy,
-    'compression::column1' 'zstd(5)',
-  )
-```
-
-In this example, we write the entirety of `source_table` out to a folder of 
parquet files. One parquet file will be written in parallel to the folder for 
each partition in the query. The next option `compression` set to `snappy` 
indicates that unless otherwise specified all columns should use the snappy 
compression codec. The option `compression::col1` sets an override, so that the 
column `col1` in the parquet file will use `ZSTD` compression codec with 
compression level `5`. In general, parquet options which support column 
specific settings can be specified with the syntax `OPTION::COLUMN.NESTED.PATH`.
-
-## Available Options
-
-### Execution Specific Options
-
-The following options are available when executing a `COPY` query.
-
-| Option                              | Description                            
                                            | Default Value |
-| ----------------------------------- | 
----------------------------------------------------------------------------------
 | ------------- |
-| execution.keep_partition_by_columns | Flag to retain the columns in the 
output data when using `PARTITIONED BY` queries. | false         |
-
-Note: `execution.keep_partition_by_columns` flag can also be enabled through 
`ExecutionOptions` within `SessionConfig`.
-
-### JSON Format Specific Options
-
-The following options are available when writing JSON files. Note: If any 
unsupported option is specified, an error will be raised and the query will 
fail.
-
-| Option      | Description                                                    
                                                                    | Default 
Value |
-| ----------- | 
----------------------------------------------------------------------------------------------------------------------------------
 | ------------- |
-| COMPRESSION | Sets the compression that should be applied to the entire JSON 
file. Supported values are GZIP, BZIP2, XZ, ZSTD, and UNCOMPRESSED. | 
UNCOMPRESSED  |
-
-### CSV Format Specific Options
-
-The following options are available when writing CSV files. Note: if any 
unsupported options is specified an error will be raised and the query will 
fail.
-
-| Option          | Description                                                
                                                                       | 
Default Value    |
-| --------------- | 
---------------------------------------------------------------------------------------------------------------------------------
 | ---------------- |
-| COMPRESSION     | Sets the compression that should be applied to the entire 
CSV file. Supported values are GZIP, BZIP2, XZ, ZSTD, and UNCOMPRESSED. | 
UNCOMPRESSED     |
-| HEADER          | Sets if the CSV file should include column headers         
                                                                       | false  
          |
-| DATE_FORMAT     | Sets the format that dates should be encoded in within the 
CSV file                                                               | 
arrow-rs default |
-| DATETIME_FORMAT | Sets the format that datetimes should be encoded in within 
the CSV file                                                           | 
arrow-rs default |
-| TIME_FORMAT     | Sets the format that times should be encoded in within the 
CSV file                                                               | 
arrow-rs default |
-| RFC3339         | If true, uses RFC339 format for date and time encodings    
                                                                       | 
arrow-rs default |
-| NULL_VALUE      | Sets the string which should be used to indicate null 
values within the CSV file.                                                 | 
arrow-rs default |
-| DELIMITER       | Sets the character which should be used as the column 
delimiter within the CSV file.                                              | 
arrow-rs default |
-
-### Parquet Format Specific Options
-
-The following options are available when writing parquet files. If any 
unsupported option is specified an error will be raised and the query will 
fail. If a column specific option is specified for a column which does not 
exist, the option will be ignored without error. For default values, see: 
[Configuration Settings](https://datafusion.apache.org/user-guide/configs.html).
-
-| Option                       | Can be Column Specific? | Description         
                                                                                
                                |
-| ---------------------------- | ----------------------- | 
-----------------------------------------------------------------------------------------------------------------------------------
 |
-| COMPRESSION                  | Yes                     | Sets the 
compression codec and if applicable compression level to use                    
                                           |
-| MAX_ROW_GROUP_SIZE           | No                      | Sets the maximum 
number of rows that can be encoded in a single row group. Larger row groups 
require more memory to write and read. |
-| DATA_PAGESIZE_LIMIT          | No                      | Sets the best 
effort maximum page size in bytes                                               
                                      |
-| WRITE_BATCH_SIZE             | No                      | Maximum number of 
rows written for each column in a single batch                                  
                                  |
-| WRITER_VERSION               | No                      | Parquet writer 
version (1.0 or 2.0)                                                            
                                     |
-| DICTIONARY_PAGE_SIZE_LIMIT   | No                      | Sets best effort 
maximum dictionary page size in bytes                                           
                                   |
-| CREATED_BY                   | No                      | Sets the "created 
by" property in the parquet file                                                
                                  |
-| COLUMN_INDEX_TRUNCATE_LENGTH | No                      | Sets the max length 
of min/max value fields in the column index.                                    
                                |
-| DATA_PAGE_ROW_COUNT_LIMIT    | No                      | Sets best effort 
maximum number of rows in a data page.                                          
                                   |

Review Comment:
   We seem to have lost some of these options in the new doc



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Improve documentation for format `OPTIONS` clause [datafusion]

Reply via email to