[ 
https://issues.apache.org/jira/browse/FLINK-20746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18086446#comment-18086446
 ] 

Vishal Kamlapure edited comment on FLINK-20746 at 6/5/26 6:17 PM:
------------------------------------------------------------------

Hi [~leonard] ,  

I investigated this issue and reproduced the behavior locally.

While tracing the current implementation, I found that the filesystem CSV path 
goes through {{{}CsvFileFormatFactory.buildCsvSchema(){}}}, which builds a 
Jackson {{{}CsvSchema{}}}. The underlying Jackson builder already supports 
{{{}setSkipFirstDataRow(true){}}}.

>From the Stack Overflow discussion, my understanding was that the main 
>challenge was supporting this generically across all CSV connectors, 
>especially for message-based sources where identifying the "first record" is 
>not straightforward.

However, for filesystem CSV sources, the reader operates on files and the first 
row of each file is well-defined. Because of that, I was wondering if it would 
make sense to implement this issue specifically for the filesystem CSV path by:
 * adding a {{csv.ignore-first-line}} option,
 * wiring it through {{{}CsvFormatOptions{}}},
 * and calling {{csvBuilder.setSkipFirstDataRow(true)}} in 
{{{}CsvFileFormatFactory{}}}.

This seems to preserve the legacy {{ignoreFirstLine}} behavior without 
introducing header-based schema mapping semantics.

If this approach sounds reasonable, I'd be happy to work on this issue and 
submit a PR.


was (Author: JIRAUSER311517):
Hi [~leonard] ,  

I investigated this issue and reproduced the behavior locally.

The legacy {{CsvTableSource}} supported skipping the first line via 
{{{}ignoreFirstLine{}}}, but the current filesystem CSV connector does not 
expose an equivalent option.

Looking through the implementation, the filesystem CSV path already builds a 
Jackson {{CsvSchema}} in {{{}CsvFileFormatFactory.buildCsvSchema(){}}}. The 
underlying Jackson schema builder supports {{{}setSkipFirstDataRow(true){}}}.

A possible implementation would be:
 * Add a new option {{csv.ignore-first-line}} (default {{{}false{}}})

 * Register it in {{CsvFormatOptions}} / {{CsvCommons}}

 * Apply it in {{CsvFileFormatFactory.buildCsvSchema()}}

 * Call {{csvBuilder.setSkipFirstDataRow(true)}} when enabled

This would preserve the legacy {{ignoreFirstLine}} semantics without 
introducing header-based column mapping ({{{}setUseHeader(true){}}}).

If this approach sounds reasonable, I'd be happy to work on this issue and 
submit a PR.

> Support ignore-first-line option for CSV format
> -----------------------------------------------
>
>                 Key: FLINK-20746
>                 URL: https://issues.apache.org/jira/browse/FLINK-20746
>             Project: Flink
>          Issue Type: Improvement
>          Components: Connectors / FileSystem, Formats (JSON, Avro, Parquet, 
> ORC, SequenceFile), Table SQL / Ecosystem
>    Affects Versions: 1.13.0
>            Reporter: Leonard Xu
>            Priority: Not a Priority
>              Labels: auto-deprioritized-major, auto-deprioritized-minor
>
> ignore-first-line option is a useful feature for CSV format in filesystem 
> connector, and I found there're users  consulting the feature in 
> stackoverflow[1]. 
>  
> [1]https://stackoverflow.com/questions/65359382/apache-flink-sql-reference-guide-for-table-properties



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to