Liu created FLINK-39083:
---------------------------
Summary: Support field-level error tolerance for CSV format
deserialization
Key: FLINK-39083
URL: https://issues.apache.org/jira/browse/FLINK-39083
Project: Flink
Issue Type: Improvement
Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
Reporter: Liu
Attachments: image-2026-02-13-10-08-41-570.png
h1. Motivation
Currently, the csv.ignore-parse-errors option in CSV format provides only two
behaviors:
* false (default): Any field-level parse error causes the entire job to fail.
* true: Any field-level parse error causes the entire row to be discarded
(returns null).
This "all-or-nothing" approach is problematic in production ETL scenarios. For
example, consider a CSV table with 50 columns where only 1 TIMESTAMP column
occasionally has malformed values. With ignore-parse-errors=true, the entire
row—including the 49 correctly parsed fields—is silently dropped. This leads to
significant and unnecessary data loss.
h1. Proposal
Introduce a new configuration option csv.ignore-single-field-parse-error
(boolean, default false) that provides field-level error tolerance:
* When enabled, if a single field fails type conversion (e.g., "abc" for an
INT column), only that field is set to null, and the rest of the row is
preserved.
* Jackson-level parsing errors (e.g., malformed CSV structure) are not
affected by this option and continue to be governed by ignore-parse-errors.
Behavior matrix:
!image-2026-02-13-10-08-41-570.png|width=532,height=149!
*Scope of Changes*
# CsvFormatOptions — Add new ConfigOption<Boolean> for
ignore-single-field-parse-error.
# CsvCommons — Register the new option in optionalOptions() and
forwardOptions().
# CsvToRowDataConverters — Core change: add ignoreSingleFieldParseErrors flag;
modify the catch block in createRowConverter() to set the failed field to null
instead of re-throwing.
# CsvRowDataDeserializationSchema — Add builder setter; pass flag through to
CsvToRowDataConverters.
# CsvFormatFactory — Wire the new config option to the deserialization schema
builder.
# CsvFileFormatFactory — Wire the new config option in the bulk decoding path.
# Tests — Add unit tests covering all four combinations in the behavior matrix.
# Documentation — Update English and Chinese CSV format docs.
h1. Compatibility
* Fully backward compatible: The new option defaults to false, preserving
existing behavior.
* No changes to serialization path: This option only affects deserialization.
* No public API changes: Only new optional configuration added.
h1. Discussion Points
I'd like to get community feedback on the following before proceeding with
implementation:
# Option naming: Is csv.ignore-single-field-parse-error clear enough?
Alternatives considered: csv.field-error-as-null, csv.partial-parse-errors.
# Interaction with ignore-parse-errors: Should
ignore-single-field-parse-error=true implicitly suppress field-level errors
even when ignore-parse-errors=false? Or should it only take effect when
ignore-parse-errors is also true?
# Cross-format consistency: Should we consider a similar option for the JSON
format (JsonToRowDataConverters) in a follow-up JIRA? The JSON format has a
similar "all-or-nothing" behavior today.
# Logging: The proposed implementation logs a WARN for each field-level error.
Should this be configurable or use a different log level (e.g., DEBUG) to avoid
log flooding?
h1. Related
Depends on / follows: https://issues.apache.org/jira/browse/FLINK-39065
--
This message was sent by Atlassian Jira
(v8.20.10#820010)