[jira] [Comment Edited] (FLINK-14266) Introduce RowCsvInputFormat to new CSV module

Jingsong Lee (Jira) Fri, 11 Oct 2019 04:29:02 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-14266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949378#comment-16949378
 ]


Jingsong Lee edited comment on FLINK-14266 at 10/11/19 11:27 AM:
-----------------------------------------------------------------

Thanks [~fhueske] , I think there are two choices:
 # Extends DelimitedInputFormat and use CsvRowDeserializationSchema to 
deserialize bytes with offset and numBytes, need deal with selectedFields too. 
DelimitedInputFormat already has the split logical to deal with half-line. But 
as fabian said, we do not know whether the next new-line character is a record 
delimiter or contained in a string field.
 # Use jackson ObjectReader.readValues(InputStream). The difficulty are:
 ## ObjectReader do not know current read offset, it has buffer to cache more 
bytes. But we need stop in the right place for reading a FileSplit. One 
solution is to use BoundedInputStream, But we need to read the unfinished line, 
so we need to modify splitLength first to find the correct end position based 
on line delimiter and escapeChar.
 ## We also need to correctly determine the line separator when starting 
reading for FileSplit that start offset is in middle of file. If first char is 
line separator, maybe the character before it is an escape character. We need 
to deal with these things carefully.

 

 


was (Author: lzljs3620320):
Thanks [~fhueske] , I think there are two choices:
 # Extends DelimitedInputFormat and use CsvRowDeserializationSchema to 
deserialize bytes with offset and numBytes, need deal with selectedFields too. 
DelimitedInputFormat already has the split logical to deal with half-line. But 
as fabian said, we do not know whether the next new-line character is a record 
delimiter or contained in a string field.
 # Use jackson ObjectReader.readValues(InputStream). The difficulty are:
 ## ObjectReader do not know current read offset, it has buffer to cache more 
bytes. One solution is to use BoundedInputStream, But we need to read the 
unfinished line, so we need to modify splitLength first to find the correct end 
position based on line delimiter and escapeChar.
 ## We also need to correctly determine the line separator when starting 
reading. If first char is line separator, maybe the character before it is an 
escape character. We need to deal with these things carefully.

 

 

> Introduce RowCsvInputFormat to new CSV module
> ---------------------------------------------
>
>                 Key: FLINK-14266
>                 URL: https://issues.apache.org/jira/browse/FLINK-14266
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Connectors / FileSystem
>            Reporter: Jingsong Lee
>            Assignee: Jingsong Lee
>            Priority: Major
>             Fix For: 1.10.0
>
>
> Now, we have an old CSV, but that is not standard CSV support. we should 
> support the RFC-compliant CSV format for table/sql.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-14266) Introduce RowCsvInputFormat to new CSV module

Reply via email to