Liu created FLINK-39065:
---------------------------

             Summary: Support additional CsvParser.Feature options for CSV 
format deserialization
                 Key: FLINK-39065
                 URL: https://issues.apache.org/jira/browse/FLINK-39065
             Project: Flink
          Issue Type: Improvement
          Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
            Reporter: Liu


h1. Motivation

Currently, the Flink CSV format only exposes a limited set of Jackson CsvParser 
configurations (field-delimiter, quote-character, allow-comments, etc.). 
However, the underlying Jackson CsvParser.Feature enum provides several useful 
parsing features that are commonly needed in production environments to handle 
various types of "dirty" CSV data:

1. *TRIM_SPACES* - Trims leading/trailing whitespace from field values. This is 
one of the most common requirements when dealing with CSV files generated by 
various tools that may pad fields with spaces.

2. *IGNORE_TRAILING_UNMAPPABLE* - Ignores extra fields at the end of a row that 
exceed the schema definition. This is useful when CSV files have additional 
columns appended over time.

3. *ALLOW_TRAILING_COMMA* - Allows a trailing comma at the end of a row without 
treating it as an additional empty field. Some CSV export tools append a 
trailing comma to every row.

4. *FAIL_ON_MISSING_COLUMNS* - Throws an error when a row has fewer columns 
than the schema expects. This is complementary to `ignore-parse-errors` and 
provides more precise error detection for data quality monitoring.

5. *EMPTY_STRING_AS_NULL* - Treats empty string values ("") as null. This is 
very common in ETL scenarios where empty fields should be treated as null 
values.

All these features are natively supported by Jackson's CsvParser.Feature since 
version 2.9+. The Flink shaded Jackson (2.18.2) already includes all of them. 
This improvement simply 
exposes these features as Flink SQL table format options.
h1. Changes
h2. New Configuration Options (all deserialization-only)

| Option Key | Type | Default | Description |
|------------|------|---------|-------------|
| csv.trim-spaces | Boolean | false | Trim leading/trailing spaces from 
unquoted field values |
| csv.ignore-trailing-unmappable | Boolean | false | Ignore extra trailing 
fields that cannot be mapped |
| csv.allow-trailing-comma | Boolean | false | Allow trailing comma after the 
last field value |
| csv.fail-on-missing-columns | Boolean | false | Fail when a row has fewer 
columns than expected |
| csv.empty-string-as-null | Boolean | false | Treat empty string values as 
null |    
h2. Modified Files

- CsvFormatOptions: Add 5 new ConfigOption definitions
- CsvCommons: Register new options in optionalOptions() and forwardOptions()
- CsvRowDataDeserializationSchema: Add CsvParser.Feature configuration support 
in Builder and open()
- CsvFileFormatFactory: Add CsvParser.Feature configuration support for batch 
file reading
- CsvFormatFactory: Wire new options to deserialization schema builder
- CsvFormatFactoryTest: Add tests for each new option
- docs: Update CSV format documentation
h1. Compatibility

- Fully backward compatible: all new options are optional with safe default 
values that 
  preserve existing behavior
- No changes to serialization path
- No API breaking changes (only additions to @Internal classes and 
@PublicEvolving config options)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to