[
https://issues.apache.org/jira/browse/FLINK-39065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Liu updated FLINK-39065:
------------------------
Description:
h1. Motivation
Currently, the Flink CSV format only exposes a limited set of Jackson CsvParser
configurations (field-delimiter, quote-character, allow-comments, etc.).
However, the underlying Jackson CsvParser.Feature enum provides several useful
parsing features that are commonly needed in production environments to handle
various types of "dirty" CSV data:
1. *TRIM_SPACES* - Trims leading/trailing whitespace from field values. This is
one of the most common requirements when dealing with CSV files generated by
various tools that may pad fields with spaces.
2. *IGNORE_TRAILING_UNMAPPABLE* - Ignores extra fields at the end of a row that
exceed the schema definition. This is useful when CSV files have additional
columns appended over time.
3. *ALLOW_TRAILING_COMMA* - Allows a trailing comma at the end of a row without
treating it as an additional empty field. Some CSV export tools append a
trailing comma to every row.
4. *FAIL_ON_MISSING_COLUMNS* - Throws an error when a row has fewer columns
than the schema expects. This is complementary to `ignore-parse-errors` and
provides more precise error detection for data quality monitoring.
5. *EMPTY_STRING_AS_NULL* - Treats empty string values ("") as null. This is
very common in ETL scenarios where empty fields should be treated as null
values.
All these features are natively supported by Jackson's CsvParser.Feature since
version 2.9+. The Flink shaded Jackson (2.18.2) already includes all of them.
This improvement simply
exposes these features as Flink SQL table format options.
h1. Changes
h2. New Configuration Options (all deserialization-only)
|Option Key|Type|Default|Description|
| | | | |
|csv.trim-spaces|Boolean|false|Trim leading/trailing spaces from unquoted field
values|
|csv.ignore-trailing-unmappable|Boolean|false|Ignore extra trailing fields that
cannot be mapped|
|csv.allow-trailing-comma|Boolean|false|Allow trailing comma after the last
field value|
|csv.fail-on-missing-columns|Boolean|false|Fail when a row has fewer columns
than expected|
|csv.empty-string-as-null|Boolean|false|Treat empty string values as null| |
h2. Modified Files
- CsvFormatOptions: Add 5 new ConfigOption definitions
- CsvCommons: Register new options in optionalOptions() and forwardOptions()
- CsvRowDataDeserializationSchema: Add CsvParser.Feature configuration support
in Builder and open()
- CsvFileFormatFactory: Add CsvParser.Feature configuration support for batch
file reading
- CsvFormatFactory: Wire new options to deserialization schema builder
- CsvFormatFactoryTest: Add tests for each new option
- docs: Update CSV format documentation
h1.
h1. Compatibility
- Fully backward compatible: all new options are optional with safe default
values that
preserve existing behavior
- No changes to serialization path
- No API breaking changes (only additions to @Internal classes and
@PublicEvolving config options)
was:
h1. Motivation
Currently, the Flink CSV format only exposes a limited set of Jackson CsvParser
configurations (field-delimiter, quote-character, allow-comments, etc.).
However, the underlying Jackson CsvParser.Feature enum provides several useful
parsing features that are commonly needed in production environments to handle
various types of "dirty" CSV data:
1. *TRIM_SPACES* - Trims leading/trailing whitespace from field values. This is
one of the most common requirements when dealing with CSV files generated by
various tools that may pad fields with spaces.
2. *IGNORE_TRAILING_UNMAPPABLE* - Ignores extra fields at the end of a row that
exceed the schema definition. This is useful when CSV files have additional
columns appended over time.
3. *ALLOW_TRAILING_COMMA* - Allows a trailing comma at the end of a row without
treating it as an additional empty field. Some CSV export tools append a
trailing comma to every row.
4. *FAIL_ON_MISSING_COLUMNS* - Throws an error when a row has fewer columns
than the schema expects. This is complementary to `ignore-parse-errors` and
provides more precise error detection for data quality monitoring.
5. *EMPTY_STRING_AS_NULL* - Treats empty string values ("") as null. This is
very common in ETL scenarios where empty fields should be treated as null
values.
All these features are natively supported by Jackson's CsvParser.Feature since
version 2.9+. The Flink shaded Jackson (2.18.2) already includes all of them.
This improvement simply
exposes these features as Flink SQL table format options.
h1. Changes
h2. New Configuration Options (all deserialization-only)
|Option Key|Type|Default|Description|
|------------|------|---------|-------------|
|csv.trim-spaces|Boolean|false|Trim leading/trailing spaces from unquoted field
values|
|csv.ignore-trailing-unmappable|Boolean|false|Ignore extra trailing fields that
cannot be mapped|
|csv.allow-trailing-comma|Boolean|false|Allow trailing comma after the last
field value|
|csv.fail-on-missing-columns|Boolean|false|Fail when a row has fewer columns
than expected|
|csv.empty-string-as-null|Boolean|false|Treat empty string values as null|
h2. |
h2. Modified Files
- CsvFormatOptions: Add 5 new ConfigOption definitions
- CsvCommons: Register new options in optionalOptions() and forwardOptions()
- CsvRowDataDeserializationSchema: Add CsvParser.Feature configuration support
in Builder and open()
- CsvFileFormatFactory: Add CsvParser.Feature configuration support for batch
file reading
- CsvFormatFactory: Wire new options to deserialization schema builder
- CsvFormatFactoryTest: Add tests for each new option
- docs: Update CSV format documentation
h1. Compatibility
- Fully backward compatible: all new options are optional with safe default
values that
preserve existing behavior
- No changes to serialization path
- No API breaking changes (only additions to @Internal classes and
@PublicEvolving config options)
> Support additional CsvParser.Feature options for CSV format deserialization
> ---------------------------------------------------------------------------
>
> Key: FLINK-39065
> URL: https://issues.apache.org/jira/browse/FLINK-39065
> Project: Flink
> Issue Type: Improvement
> Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
> Reporter: Liu
> Priority: Major
>
> h1. Motivation
> Currently, the Flink CSV format only exposes a limited set of Jackson
> CsvParser configurations (field-delimiter, quote-character, allow-comments,
> etc.). However, the underlying Jackson CsvParser.Feature enum provides
> several useful parsing features that are commonly needed in production
> environments to handle various types of "dirty" CSV data:
> 1. *TRIM_SPACES* - Trims leading/trailing whitespace from field values. This
> is one of the most common requirements when dealing with CSV files generated
> by various tools that may pad fields with spaces.
> 2. *IGNORE_TRAILING_UNMAPPABLE* - Ignores extra fields at the end of a row
> that exceed the schema definition. This is useful when CSV files have
> additional columns appended over time.
> 3. *ALLOW_TRAILING_COMMA* - Allows a trailing comma at the end of a row
> without treating it as an additional empty field. Some CSV export tools
> append a trailing comma to every row.
> 4. *FAIL_ON_MISSING_COLUMNS* - Throws an error when a row has fewer columns
> than the schema expects. This is complementary to `ignore-parse-errors` and
> provides more precise error detection for data quality monitoring.
> 5. *EMPTY_STRING_AS_NULL* - Treats empty string values ("") as null. This is
> very common in ETL scenarios where empty fields should be treated as null
> values.
> All these features are natively supported by Jackson's CsvParser.Feature
> since version 2.9+. The Flink shaded Jackson (2.18.2) already includes all of
> them. This improvement simply
> exposes these features as Flink SQL table format options.
> h1. Changes
> h2. New Configuration Options (all deserialization-only)
> |Option Key|Type|Default|Description|
> | | | | |
> |csv.trim-spaces|Boolean|false|Trim leading/trailing spaces from unquoted
> field values|
> |csv.ignore-trailing-unmappable|Boolean|false|Ignore extra trailing fields
> that cannot be mapped|
> |csv.allow-trailing-comma|Boolean|false|Allow trailing comma after the last
> field value|
> |csv.fail-on-missing-columns|Boolean|false|Fail when a row has fewer columns
> than expected|
> |csv.empty-string-as-null|Boolean|false|Treat empty string values as null| |
> h2. Modified Files
> - CsvFormatOptions: Add 5 new ConfigOption definitions
> - CsvCommons: Register new options in optionalOptions() and forwardOptions()
> - CsvRowDataDeserializationSchema: Add CsvParser.Feature configuration
> support in Builder and open()
> - CsvFileFormatFactory: Add CsvParser.Feature configuration support for
> batch file reading
> - CsvFormatFactory: Wire new options to deserialization schema builder
> - CsvFormatFactoryTest: Add tests for each new option
> - docs: Update CSV format documentation
> h1.
> h1. Compatibility
> - Fully backward compatible: all new options are optional with safe default
> values that
> preserve existing behavior
> - No changes to serialization path
> - No API breaking changes (only additions to @Internal classes and
> @PublicEvolving config options)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)