[
https://issues.apache.org/jira/browse/FLINK-39065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated FLINK-39065:
-----------------------------------
Labels: pull-request-available (was: )
> Support additional CsvParser.Feature options for CSV format deserialization
> ---------------------------------------------------------------------------
>
> Key: FLINK-39065
> URL: https://issues.apache.org/jira/browse/FLINK-39065
> Project: Flink
> Issue Type: Improvement
> Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
> Reporter: Liu
> Priority: Major
> Labels: pull-request-available
>
> h1. Motivation
> Currently, the Flink CSV format only exposes a limited set of Jackson
> CsvParser configurations (field-delimiter, quote-character, allow-comments,
> etc.). However, the underlying Jackson CsvParser.Feature enum provides
> several useful parsing features that are commonly needed in production
> environments to handle various types of "dirty" CSV data:
> 1. *TRIM_SPACES* - Trims leading/trailing whitespace from field values. This
> is one of the most common requirements when dealing with CSV files generated
> by various tools that may pad fields with spaces.
> 2. *IGNORE_TRAILING_UNMAPPABLE* - Ignores extra fields at the end of a row
> that exceed the schema definition. This is useful when CSV files have
> additional columns appended over time.
> 3. *ALLOW_TRAILING_COMMA* - Allows a trailing comma at the end of a row
> without treating it as an additional empty field. Some CSV export tools
> append a trailing comma to every row.
> 4. *FAIL_ON_MISSING_COLUMNS* - Throws an error when a row has fewer columns
> than the schema expects. This is complementary to `ignore-parse-errors` and
> provides more precise error detection for data quality monitoring.
> 5. *EMPTY_STRING_AS_NULL* - Treats empty string values ("") as null. This is
> very common in ETL scenarios where empty fields should be treated as null
> values.
> All these features are natively supported by Jackson's CsvParser.Feature
> since version 2.9+. The Flink shaded Jackson (2.18.2) already includes all of
> them. This improvement simply
> exposes these features as Flink SQL table format options.
> h1. Changes
> h2. New Configuration Options (all deserialization-only)
> |Option Key|Type|Default|Description|
> | | | | |
> |csv.trim-spaces|Boolean|false|Trim leading/trailing spaces from unquoted
> field values|
> |csv.ignore-trailing-unmappable|Boolean|false|Ignore extra trailing fields
> that cannot be mapped|
> |csv.allow-trailing-comma|Boolean|false|Allow trailing comma after the last
> field value|
> |csv.fail-on-missing-columns|Boolean|false|Fail when a row has fewer columns
> than expected|
> |csv.empty-string-as-null|Boolean|false|Treat empty string values as null| |
> h2. Modified Files
> - CsvFormatOptions: Add 5 new ConfigOption definitions
> - CsvCommons: Register new options in optionalOptions() and forwardOptions()
> - CsvRowDataDeserializationSchema: Add CsvParser.Feature configuration
> support in Builder and open()
> - CsvFileFormatFactory: Add CsvParser.Feature configuration support for
> batch file reading
> - CsvFormatFactory: Wire new options to deserialization schema builder
> - CsvFormatFactoryTest: Add tests for each new option
> - docs: Update CSV format documentation
> h1.
> h1. Compatibility
> - Fully backward compatible: all new options are optional with safe default
> values that
> preserve existing behavior
> - No changes to serialization path
> - No API breaking changes (only additions to @Internal classes and
> @PublicEvolving config options)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)