[ 
https://issues.apache.org/jira/browse/FLINK-39065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-39065:
-----------------------------------
    Labels: pull-request-available  (was: )

> Support additional CsvParser.Feature options for CSV format deserialization
> ---------------------------------------------------------------------------
>
>                 Key: FLINK-39065
>                 URL: https://issues.apache.org/jira/browse/FLINK-39065
>             Project: Flink
>          Issue Type: Improvement
>          Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>            Reporter: Liu
>            Priority: Major
>              Labels: pull-request-available
>
> h1. Motivation
> Currently, the Flink CSV format only exposes a limited set of Jackson 
> CsvParser configurations (field-delimiter, quote-character, allow-comments, 
> etc.). However, the underlying Jackson CsvParser.Feature enum provides 
> several useful parsing features that are commonly needed in production 
> environments to handle various types of "dirty" CSV data:
> 1. *TRIM_SPACES* - Trims leading/trailing whitespace from field values. This 
> is one of the most common requirements when dealing with CSV files generated 
> by various tools that may pad fields with spaces.
> 2. *IGNORE_TRAILING_UNMAPPABLE* - Ignores extra fields at the end of a row 
> that exceed the schema definition. This is useful when CSV files have 
> additional columns appended over time.
> 3. *ALLOW_TRAILING_COMMA* - Allows a trailing comma at the end of a row 
> without treating it as an additional empty field. Some CSV export tools 
> append a trailing comma to every row.
> 4. *FAIL_ON_MISSING_COLUMNS* - Throws an error when a row has fewer columns 
> than the schema expects. This is complementary to `ignore-parse-errors` and 
> provides more precise error detection for data quality monitoring.
> 5. *EMPTY_STRING_AS_NULL* - Treats empty string values ("") as null. This is 
> very common in ETL scenarios where empty fields should be treated as null 
> values.
> All these features are natively supported by Jackson's CsvParser.Feature 
> since version 2.9+. The Flink shaded Jackson (2.18.2) already includes all of 
> them. This improvement simply 
> exposes these features as Flink SQL table format options.
> h1. Changes
> h2. New Configuration Options (all deserialization-only)
> |Option Key|Type|Default|Description|
> | | | | |
> |csv.trim-spaces|Boolean|false|Trim leading/trailing spaces from unquoted 
> field values|
> |csv.ignore-trailing-unmappable|Boolean|false|Ignore extra trailing fields 
> that cannot be mapped|
> |csv.allow-trailing-comma|Boolean|false|Allow trailing comma after the last 
> field value|
> |csv.fail-on-missing-columns|Boolean|false|Fail when a row has fewer columns 
> than expected|
> |csv.empty-string-as-null|Boolean|false|Treat empty string values as null| |
> h2. Modified Files
>  - CsvFormatOptions: Add 5 new ConfigOption definitions
>  - CsvCommons: Register new options in optionalOptions() and forwardOptions()
>  - CsvRowDataDeserializationSchema: Add CsvParser.Feature configuration 
> support in Builder and open()
>  - CsvFileFormatFactory: Add CsvParser.Feature configuration support for 
> batch file reading
>  - CsvFormatFactory: Wire new options to deserialization schema builder
>  - CsvFormatFactoryTest: Add tests for each new option
>  - docs: Update CSV format documentation
> h1.  
> h1. Compatibility
>  - Fully backward compatible: all new options are optional with safe default 
> values that 
>   preserve existing behavior
>  - No changes to serialization path
>  - No API breaking changes (only additions to @Internal classes and 
> @PublicEvolving config options)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to