[ 
https://issues.apache.org/jira/browse/FLINK-20221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235988#comment-17235988
 ] 

Gyula Fora commented on FLINK-20221:
------------------------------------

[~rmetzger] , I would like to open a PR but first I think I need some feedback 
from someone who understands the internals of these input formats more in 
detail. 

Turns out this is not a super straightforward change as the current behaviour 
of `reopen` calls `open` which causes some problems as the open method already 
start reading the stream which would trigger a backwards seek in the reopen in 
many cases. Backwards seek is however not supported on the wrapper inflater 
input streams used for compressed files.

So maybe we have to change reopen so as to not call open (or maybe only call it 
if we checkpointed the state before we started reading the split, not even sure 
if this is possible). 

In any case changing the reopen this way would cause problems in the hierarchy 
as other subclasses like CsvInputFormat rely on the open method being always 
called (even on restore). 

I am not super eager to push some last minute change that will break this 
completely :D  

> DelimitedInputFormat does not restore compressed filesplits correctly leading 
> to dataloss
> -----------------------------------------------------------------------------------------
>
>                 Key: FLINK-20221
>                 URL: https://issues.apache.org/jira/browse/FLINK-20221
>             Project: Flink
>          Issue Type: Bug
>          Components: Connectors / FileSystem
>    Affects Versions: 1.10.2, 1.12.0, 1.11.2
>            Reporter: Gyula Fora
>            Assignee: Gyula Fora
>            Priority: Blocker
>             Fix For: 1.12.0, 1.10.3, 1.11.3
>
>
> It seems that the delimited input format cannot correctly restore input 
> splits if they belong to compressed files. Basically when a compressed 
> filesplit is restored in the middle, it won't read it anymore leading to 
> dataloss.
> The cause of the problem is that for compressed splits that use an inflater 
> stream, the splitlength is set to the magic number -1 which is ignored in the 
> reopen method and causes the split to go to `end` state immediately.
> The problem and the fix is shown in this commit:
> [https://github.com/gyfora/flink/commit/4adc8ba8d1989fff2db43881c9cb3799848c6e0d]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to