[ 
https://issues.apache.org/jira/browse/FLINK-20276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17236994#comment-17236994
 ] 

Gyula Fora commented on FLINK-20276:
------------------------------------

Some compressed formats like Bzip2 are splittable at block boundaries (when 
using certain codecs like Hadoop's bzip2 codec) but this seems to be fairly 
tricky to integrate with the current FileInputFormat. The problem is that the 
InputFormat itself tracks the read number of bytes instead of getting the 
actual offsets of the compressed file splits.

I wonder if this is something that is worth thinking about at this point (for 
the new File Source) or we can simply deal with it later. What do you think 
[~sewen]?

> Transparent DeCompression of streams missing on new File Source
> ---------------------------------------------------------------
>
>                 Key: FLINK-20276
>                 URL: https://issues.apache.org/jira/browse/FLINK-20276
>             Project: Flink
>          Issue Type: Bug
>          Components: Connectors / FileSystem
>            Reporter: Stephan Ewen
>            Assignee: Stephan Ewen
>            Priority: Critical
>             Fix For: 1.12.0
>
>
> The existing {{FileInputFormat}} applies decompression (gzip, xy, ...) 
> automatically on the file input stream, based on the file extension.
> We need to add similar functionality for the {{StreamRecordFormat}} of the 
> new FileSource to be on par with this functionality.
> This can be easily applied in the {{StreamFormatAdapter}} when opening the 
> file stream.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to