[ 
https://issues.apache.org/jira/browse/FLINK-20295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17238616#comment-17238616
 ] 

Stephan Ewen edited comment on FLINK-20295 at 11/25/20, 10:46 AM:
------------------------------------------------------------------

StreamRecordFormats can be splittable. I think the biggest problem is how to do 
the line parsing in a way that supports charsets properly. -Even for UTF-8, 
just searching byte-wise for a '\n' character leads to wrong results due to 
multi-byte code points.- The current DelimitedInputFormat cannot handle various 
cases.


was (Author: stephanewen):
StreamRecordFormats can be splittable. I think the biggest problem is how to do 
the line parsing in a way that supports charsets properly. Even for UTF-8, just 
searching byte-wise for a '\n' character leads to wrong results due to 
multi-byte code points. The current DelimitedInputFormat cannot handle these 
cases.

> File Source lost data when reading from directories created by 
> FileSystemTableSink with JSON format
> ---------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-20295
>                 URL: https://issues.apache.org/jira/browse/FLINK-20295
>             Project: Flink
>          Issue Type: Bug
>          Components: Connectors / FileSystem, Table SQL / Ecosystem
>            Reporter: Yun Gao
>            Assignee: Jingsong Lee
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 1.12.0
>
>         Attachments: compaction.tgz
>
>
> When testing the compaction functionality of the FileSystemTableSink, I found 
> that when using json format, the produced directories could not be read 
> correctly by the file source, namely only a part of records are read.
> By checking the produced directories, the number of the records in it is the 
> same as expected, thus it seems to be the issue of the source side.
>  
> The issue only exists for JSON format.
> The data is produced by 
> [FileCompactionTest|https://github.com/gaoyunhaii/flink1.12test/blob/main/src/main/java/FileCompactionTest.java]
>  and read by  
> [FileCompactionCheckTest|https://github.com/gaoyunhaii/flink1.12test/blob/main/src/main/java/FileCompactionCheckTest.java]
>  . An example directories tar file of 8000 records are also attached.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to