[
https://issues.apache.org/jira/browse/FLINK-14266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949378#comment-16949378
]
Jingsong Lee edited comment on FLINK-14266 at 10/11/19 11:27 AM:
-----------------------------------------------------------------
Thanks [~fhueske] , I think there are two choices:
# Extends DelimitedInputFormat and use CsvRowDeserializationSchema to
deserialize bytes with offset and numBytes, need deal with selectedFields too.
DelimitedInputFormat already has the split logical to deal with half-line. But
as fabian said, we do not know whether the next new-line character is a record
delimiter or contained in a string field.
# Use jackson ObjectReader.readValues(InputStream). The difficulty are:
## ObjectReader do not know current read offset, it has buffer to cache more
bytes. But we need stop in the right place for reading a FileSplit. One
solution is to use BoundedInputStream, But we need to read the unfinished line,
so we need to modify splitLength first to find the correct end position based
on line delimiter and escapeChar.
## We also need to correctly determine the line separator when starting
reading for FileSplit that start offset is in middle of file. If first char is
line separator, maybe the character before it is an escape character. We need
to deal with these things carefully.
was (Author: lzljs3620320):
Thanks [~fhueske] , I think there are two choices:
# Extends DelimitedInputFormat and use CsvRowDeserializationSchema to
deserialize bytes with offset and numBytes, need deal with selectedFields too.
DelimitedInputFormat already has the split logical to deal with half-line. But
as fabian said, we do not know whether the next new-line character is a record
delimiter or contained in a string field.
# Use jackson ObjectReader.readValues(InputStream). The difficulty are:
## ObjectReader do not know current read offset, it has buffer to cache more
bytes. One solution is to use BoundedInputStream, But we need to read the
unfinished line, so we need to modify splitLength first to find the correct end
position based on line delimiter and escapeChar.
## We also need to correctly determine the line separator when starting
reading. If first char is line separator, maybe the character before it is an
escape character. We need to deal with these things carefully.
> Introduce RowCsvInputFormat to new CSV module
> ---------------------------------------------
>
> Key: FLINK-14266
> URL: https://issues.apache.org/jira/browse/FLINK-14266
> Project: Flink
> Issue Type: Sub-task
> Components: Connectors / FileSystem
> Reporter: Jingsong Lee
> Assignee: Jingsong Lee
> Priority: Major
> Fix For: 1.10.0
>
>
> Now, we have an old CSV, but that is not standard CSV support. we should
> support the RFC-compliant CSV format for table/sql.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)