[ 
https://issues.apache.org/jira/browse/FLINK-39065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liu updated FLINK-39065:
------------------------
    Description: 
h1. Motivation

Currently, the Flink CSV format only exposes a limited set of Jackson CsvParser 
configurations (field-delimiter, quote-character, allow-comments, etc.). 
However, the underlying Jackson CsvParser.Feature enum provides several useful 
parsing features that are commonly needed in production environments to handle 
various types of "dirty" CSV data:

1. *TRIM_SPACES* - Trims leading/trailing whitespace from field values. This is 
one of the most common requirements when dealing with CSV files generated by 
various tools that may pad fields with spaces.

2. *IGNORE_TRAILING_UNMAPPABLE* - Ignores extra fields at the end of a row that 
exceed the schema definition. This is useful when CSV files have additional 
columns appended over time.

3. *ALLOW_TRAILING_COMMA* - Allows a trailing comma at the end of a row without 
treating it as an additional empty field. Some CSV export tools append a 
trailing comma to every row.

4. *FAIL_ON_MISSING_COLUMNS* - Throws an error when a row has fewer columns 
than the schema expects. This is complementary to `ignore-parse-errors` and 
provides more precise error detection for data quality monitoring.

5. *EMPTY_STRING_AS_NULL* - Treats empty string values ("") as null. This is 
very common in ETL scenarios where empty fields should be treated as null 
values.

All these features are natively supported by Jackson's CsvParser.Feature since 
version 2.9+. The Flink shaded Jackson (2.18.2) already includes all of them. 
This improvement simply 
exposes these features as Flink SQL table format options.
h1. Changes
h2. New Configuration Options (all deserialization-only)
|Option Key|Type|Default|Description|
| | | | |
|csv.trim-spaces|Boolean|false|Trim leading/trailing spaces from unquoted field 
values|
|csv.ignore-trailing-unmappable|Boolean|false|Ignore extra trailing fields that 
cannot be mapped|
|csv.allow-trailing-comma|Boolean|false|Allow trailing comma after the last 
field value|
|csv.fail-on-missing-columns|Boolean|false|Fail when a row has fewer columns 
than expected|
|csv.empty-string-as-null|Boolean|false|Treat empty string values as null| |
h2. Modified Files
 - CsvFormatOptions: Add 5 new ConfigOption definitions
 - CsvCommons: Register new options in optionalOptions() and forwardOptions()
 - CsvRowDataDeserializationSchema: Add CsvParser.Feature configuration support 
in Builder and open()
 - CsvFileFormatFactory: Add CsvParser.Feature configuration support for batch 
file reading
 - CsvFormatFactory: Wire new options to deserialization schema builder
 - CsvFormatFactoryTest: Add tests for each new option
 - docs: Update CSV format documentation
h1.  

h1. Compatibility
 - Fully backward compatible: all new options are optional with safe default 
values that 
  preserve existing behavior
 - No changes to serialization path
 - No API breaking changes (only additions to @Internal classes and 
@PublicEvolving config options)

  was:
h1. Motivation

Currently, the Flink CSV format only exposes a limited set of Jackson CsvParser 
configurations (field-delimiter, quote-character, allow-comments, etc.). 
However, the underlying Jackson CsvParser.Feature enum provides several useful 
parsing features that are commonly needed in production environments to handle 
various types of "dirty" CSV data:

1. *TRIM_SPACES* - Trims leading/trailing whitespace from field values. This is 
one of the most common requirements when dealing with CSV files generated by 
various tools that may pad fields with spaces.

2. *IGNORE_TRAILING_UNMAPPABLE* - Ignores extra fields at the end of a row that 
exceed the schema definition. This is useful when CSV files have additional 
columns appended over time.

3. *ALLOW_TRAILING_COMMA* - Allows a trailing comma at the end of a row without 
treating it as an additional empty field. Some CSV export tools append a 
trailing comma to every row.

4. *FAIL_ON_MISSING_COLUMNS* - Throws an error when a row has fewer columns 
than the schema expects. This is complementary to `ignore-parse-errors` and 
provides more precise error detection for data quality monitoring.

5. *EMPTY_STRING_AS_NULL* - Treats empty string values ("") as null. This is 
very common in ETL scenarios where empty fields should be treated as null 
values.

All these features are natively supported by Jackson's CsvParser.Feature since 
version 2.9+. The Flink shaded Jackson (2.18.2) already includes all of them. 
This improvement simply 
exposes these features as Flink SQL table format options.
h1. Changes
h2. New Configuration Options (all deserialization-only)
|Option Key|Type|Default|Description|
|------------|------|---------|-------------|
|csv.trim-spaces|Boolean|false|Trim leading/trailing spaces from unquoted field 
values|
|csv.ignore-trailing-unmappable|Boolean|false|Ignore extra trailing fields that 
cannot be mapped|
|csv.allow-trailing-comma|Boolean|false|Allow trailing comma after the last 
field value|
|csv.fail-on-missing-columns|Boolean|false|Fail when a row has fewer columns 
than expected|
|csv.empty-string-as-null|Boolean|false|Treat empty string values as null|    
h2.  |
h2. Modified Files
 - CsvFormatOptions: Add 5 new ConfigOption definitions
 - CsvCommons: Register new options in optionalOptions() and forwardOptions()
 - CsvRowDataDeserializationSchema: Add CsvParser.Feature configuration support 
in Builder and open()
 - CsvFileFormatFactory: Add CsvParser.Feature configuration support for batch 
file reading
 - CsvFormatFactory: Wire new options to deserialization schema builder
 - CsvFormatFactoryTest: Add tests for each new option
 - docs: Update CSV format documentation
h1. Compatibility

 - Fully backward compatible: all new options are optional with safe default 
values that 
  preserve existing behavior
 - No changes to serialization path
 - No API breaking changes (only additions to @Internal classes and 
@PublicEvolving config options)


> Support additional CsvParser.Feature options for CSV format deserialization
> ---------------------------------------------------------------------------
>
>                 Key: FLINK-39065
>                 URL: https://issues.apache.org/jira/browse/FLINK-39065
>             Project: Flink
>          Issue Type: Improvement
>          Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>            Reporter: Liu
>            Priority: Major
>
> h1. Motivation
> Currently, the Flink CSV format only exposes a limited set of Jackson 
> CsvParser configurations (field-delimiter, quote-character, allow-comments, 
> etc.). However, the underlying Jackson CsvParser.Feature enum provides 
> several useful parsing features that are commonly needed in production 
> environments to handle various types of "dirty" CSV data:
> 1. *TRIM_SPACES* - Trims leading/trailing whitespace from field values. This 
> is one of the most common requirements when dealing with CSV files generated 
> by various tools that may pad fields with spaces.
> 2. *IGNORE_TRAILING_UNMAPPABLE* - Ignores extra fields at the end of a row 
> that exceed the schema definition. This is useful when CSV files have 
> additional columns appended over time.
> 3. *ALLOW_TRAILING_COMMA* - Allows a trailing comma at the end of a row 
> without treating it as an additional empty field. Some CSV export tools 
> append a trailing comma to every row.
> 4. *FAIL_ON_MISSING_COLUMNS* - Throws an error when a row has fewer columns 
> than the schema expects. This is complementary to `ignore-parse-errors` and 
> provides more precise error detection for data quality monitoring.
> 5. *EMPTY_STRING_AS_NULL* - Treats empty string values ("") as null. This is 
> very common in ETL scenarios where empty fields should be treated as null 
> values.
> All these features are natively supported by Jackson's CsvParser.Feature 
> since version 2.9+. The Flink shaded Jackson (2.18.2) already includes all of 
> them. This improvement simply 
> exposes these features as Flink SQL table format options.
> h1. Changes
> h2. New Configuration Options (all deserialization-only)
> |Option Key|Type|Default|Description|
> | | | | |
> |csv.trim-spaces|Boolean|false|Trim leading/trailing spaces from unquoted 
> field values|
> |csv.ignore-trailing-unmappable|Boolean|false|Ignore extra trailing fields 
> that cannot be mapped|
> |csv.allow-trailing-comma|Boolean|false|Allow trailing comma after the last 
> field value|
> |csv.fail-on-missing-columns|Boolean|false|Fail when a row has fewer columns 
> than expected|
> |csv.empty-string-as-null|Boolean|false|Treat empty string values as null| |
> h2. Modified Files
>  - CsvFormatOptions: Add 5 new ConfigOption definitions
>  - CsvCommons: Register new options in optionalOptions() and forwardOptions()
>  - CsvRowDataDeserializationSchema: Add CsvParser.Feature configuration 
> support in Builder and open()
>  - CsvFileFormatFactory: Add CsvParser.Feature configuration support for 
> batch file reading
>  - CsvFormatFactory: Wire new options to deserialization schema builder
>  - CsvFormatFactoryTest: Add tests for each new option
>  - docs: Update CSV format documentation
> h1.  
> h1. Compatibility
>  - Fully backward compatible: all new options are optional with safe default 
> values that 
>   preserve existing behavior
>  - No changes to serialization path
>  - No API breaking changes (only additions to @Internal classes and 
> @PublicEvolving config options)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to