Vishal-Kamlapure opened a new pull request, #28594: URL: https://github.com/apache/flink/pull/28594
## What is the purpose of the change This pull request adds support for `csv.ignore-first-line` for the **filesystem** CSV Table connector ([FLINK-20746](https://issues.apache.org/jira/browse/FLINK-20746)). The filesystem CSV connector currently treats the first CSV record as data. When reading CSV files with a header row, users must preprocess the files or rely on `csv.ignore-parse-errors`, which still attempts to deserialize the header. This change introduces a new format option, `csv.ignore-first-line`, that skips the first CSV record of each input file during deserialization. The table schema is **not** derived from the skipped record; this option only skips the first record and does not enable header-based schema inference. The option defaults to `false`, preserving the existing behavior. It is supported only for the filesystem CSV connector and does not affect message-based CSV formats. Example: ```sql CREATE TABLE my_csv ( id STRING, name STRING, amount STRING ) WITH ( 'connector' = 'filesystem', 'path' = '/path/to/csv', 'format' = 'csv', 'csv.ignore-first-line' = 'true' ); ``` ## Brief change log - Added a new CSV format option `csv.ignore-first-line` (default: `false`). - Registered the option only for the filesystem CSV connector. - Applied the option during filesystem CSV deserialization. - Added unit tests covering the default behavior and the new option. - Updated both English and Chinese CSV connector documentation. ## Verifying this change This change added tests and can be verified as follows: - Added `CsvFormatFactoryTest#testIgnoreFirstLineSkipsFirstRecord`, verifying that enabling `csv.ignore-first-line` skips the first CSV record. - Added `CsvFormatFactoryTest#testIgnoreFirstLineDisabledByDefault`, verifying that the default behavior remains unchanged. - Verified the module locally using: ```bash ./mvnw -pl flink-formats/flink-csv -Dtest=CsvFormatFactoryTest#testIgnoreFirstLineSkipsFirstRecord,CsvFormatFactoryTest#testIgnoreFirstLineDisabledByDefault test ./mvnw -pl flink-formats/flink-csv verify ``` ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): **no** - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: **yes** (adds a new `ConfigOption` to `CsvFormatOptions`) - The serializers: **no** - The runtime per-record code paths (performance sensitive): **yes** (only when `csv.ignore-first-line=true`; default behavior is unchanged) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: **no** - The S3 file system connector: **no** ## Documentation - Does this pull request introduce a new feature? **yes** - If yes, how is the feature documented? **docs** (English and Chinese CSV connector documentation updated) --- ##### Was generative AI tooling used to co-author this PR? - [ ] Yes (please specify the tool below) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
