Liu created FLINK-39065:
---------------------------
Summary: Support additional CsvParser.Feature options for CSV
format deserialization
Key: FLINK-39065
URL: https://issues.apache.org/jira/browse/FLINK-39065
Project: Flink
Issue Type: Improvement
Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
Reporter: Liu
h1. Motivation
Currently, the Flink CSV format only exposes a limited set of Jackson CsvParser
configurations (field-delimiter, quote-character, allow-comments, etc.).
However, the underlying Jackson CsvParser.Feature enum provides several useful
parsing features that are commonly needed in production environments to handle
various types of "dirty" CSV data:
1. *TRIM_SPACES* - Trims leading/trailing whitespace from field values. This is
one of the most common requirements when dealing with CSV files generated by
various tools that may pad fields with spaces.
2. *IGNORE_TRAILING_UNMAPPABLE* - Ignores extra fields at the end of a row that
exceed the schema definition. This is useful when CSV files have additional
columns appended over time.
3. *ALLOW_TRAILING_COMMA* - Allows a trailing comma at the end of a row without
treating it as an additional empty field. Some CSV export tools append a
trailing comma to every row.
4. *FAIL_ON_MISSING_COLUMNS* - Throws an error when a row has fewer columns
than the schema expects. This is complementary to `ignore-parse-errors` and
provides more precise error detection for data quality monitoring.
5. *EMPTY_STRING_AS_NULL* - Treats empty string values ("") as null. This is
very common in ETL scenarios where empty fields should be treated as null
values.
All these features are natively supported by Jackson's CsvParser.Feature since
version 2.9+. The Flink shaded Jackson (2.18.2) already includes all of them.
This improvement simply
exposes these features as Flink SQL table format options.
h1. Changes
h2. New Configuration Options (all deserialization-only)
| Option Key | Type | Default | Description |
|------------|------|---------|-------------|
| csv.trim-spaces | Boolean | false | Trim leading/trailing spaces from
unquoted field values |
| csv.ignore-trailing-unmappable | Boolean | false | Ignore extra trailing
fields that cannot be mapped |
| csv.allow-trailing-comma | Boolean | false | Allow trailing comma after the
last field value |
| csv.fail-on-missing-columns | Boolean | false | Fail when a row has fewer
columns than expected |
| csv.empty-string-as-null | Boolean | false | Treat empty string values as
null |
h2. Modified Files
- CsvFormatOptions: Add 5 new ConfigOption definitions
- CsvCommons: Register new options in optionalOptions() and forwardOptions()
- CsvRowDataDeserializationSchema: Add CsvParser.Feature configuration support
in Builder and open()
- CsvFileFormatFactory: Add CsvParser.Feature configuration support for batch
file reading
- CsvFormatFactory: Wire new options to deserialization schema builder
- CsvFormatFactoryTest: Add tests for each new option
- docs: Update CSV format documentation
h1. Compatibility
- Fully backward compatible: all new options are optional with safe default
values that
preserve existing behavior
- No changes to serialization path
- No API breaking changes (only additions to @Internal classes and
@PublicEvolving config options)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)