Myracle opened a new pull request, #27578:
URL: https://github.com/apache/flink/pull/27578

   ## What is the purpose of the change
   
   This pull request exposes 5 additional Jackson `CsvParser.Feature` options 
as Flink SQL CSV format configuration options, allowing users to fine-tune CSV 
deserialization behavior. Currently, the CSV format connector only exposes a 
limited set of parser options (like `csv.allow-comments` and 
`csv.ignore-parse-errors`), but several useful Jackson CSV parser features are 
not accessible. This change adds the following new options:
   
   - `csv.trim-spaces` — Trims leading/trailing whitespace from unquoted field 
values (`CsvParser.Feature.TRIM_SPACES`)
   - `csv.ignore-trailing-unmappable` — Ignores extra trailing columns that 
don't map to the schema (`CsvParser.Feature.IGNORE_TRAILING_UNMAPPABLE`)
   - `csv.allow-trailing-comma` — Allows a trailing comma after the last field 
value (`CsvParser.Feature.ALLOW_TRAILING_COMMA`)
   - `csv.fail-on-missing-columns` — Fails when a row has fewer columns than 
expected by the schema (`CsvParser.Feature.FAIL_ON_MISSING_COLUMNS`)
   - `csv.empty-string-as-null` — Treats empty string values as null 
(`CsvParser.Feature.EMPTY_STRING_AS_NULL`)
   
   These options only affect deserialization (source side).
   
   ## Brief change log
   
   - Added 5 new `ConfigOption<Boolean>` definitions in `CsvFormatOptions` with 
descriptions indicating they only affect deserialization
   - Registered new options in `CsvCommons` as both optional and forwarded 
options
   - Extended `CsvRowDataDeserializationSchema.Builder` with setter methods for 
each new feature, and configured enabled/disabled features on the `CsvMapper` 
during `open()`
   - Updated `CsvFormatFactory.configureDeserializationSchema()` to read and 
pass the new options to the schema builder
   - Updated `CsvFileFormatFactory` to support the new features in the Bulk 
Format / File Source path via `createCsvMapperFactory()`
   - Fixed a pre-existing bug in `CsvFileFormatFactory` where 
`ignoreParseErrors` was determined by `isPresent()` instead of reading the 
actual config value
   - Updated both English and Chinese documentation with descriptions for the 5 
new options
   
   ## Verifying this change
   
   This change added tests and can be verified as follows:
   
   - Added `testTrimSpaces()` test in `CsvFormatFactoryTest` to verify the 
`csv.trim-spaces` option trims whitespace from unquoted field values
   - Added `testIgnoreTrailingUnmappable()` test to verify extra trailing 
columns are silently ignored
   - Added `testAllowTrailingComma()` test to verify a trailing comma after the 
last field value is accepted
   - Added `testFailOnMissingColumns()` test to verify deserialization fails 
when a row has fewer columns than expected
   - Added `testEmptyStringAsNull()` test to verify empty strings are treated 
as null values
   - Added `testAllCsvParserFeaturesTogether()` test to verify all 5 new 
features work correctly when enabled simultaneously
   - Updated `testSeDeSchema()` test to include `csv.trim-spaces` and 
`csv.empty-string-as-null` options, verifying the complete option-to-schema 
configuration chain
   - Added `testBulkFormatWithParserFeatures()` test to verify the Bulk Format 
/ File Source path correctly applies the new `CsvParser.Feature` options via 
`CsvBulkDecodingFormat`
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: yes (`CsvFormatOptions` is annotated with 
`@PublicEvolving`, 5 new `ConfigOption` fields are added)
     - The serializers: no
     - The runtime per-record code paths (performance sensitive): no (the 
features are configured once during `open()`, not per-record)
     - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
     - The S3 file system connector: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? yes
     - If yes, how is the feature documented? docs (both English and Chinese 
documentation updated in `docs/content/docs/connectors/table/formats/csv.md` 
and `docs/content.zh/docs/connectors/table/formats/csv.md`)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to