[
https://issues.apache.org/jira/browse/BEAM-167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225122#comment-15225122
]
Daniel Halperin commented on BEAM-167:
--------------------------------------
Thanks Eugene!
> TextIO can't read concatenated gzip files
> -----------------------------------------
>
> Key: BEAM-167
> URL: https://issues.apache.org/jira/browse/BEAM-167
> Project: Beam
> Issue Type: Bug
> Components: sdk-java-extensions
> Reporter: Eugene Kirpichov
> Assignee: Luke Cwik
>
> $ cat <<END > header.csv
> a,b,c
> END
> $ cat <<END > body.csv
> 1,2,3
> 4,5,6
> 7,8,9
> END
> $ gzip -c header.csv > file.gz
> $ gzip -c body.csv >> file.gz
> The file is well-formed:
> $ gzip -dc file.gz
> a,b,c
> 1,2,3
> 4,5,6
> 7,8,9
> However, TextIO.Read.from("/path/to/file.gz") will read only "a,b,c" -
> reproducible even when the file is on local disk and with the
> DirectPipelineRunner.
> The bug is in CompressedSource. It uses GzipCompressorInputStream, which by
> default reads only the first gzip stream in the file, but has an option to
> read all of them. Previously (in Dataflow SDK 1.4.0) we used GZIPInputStream
> which reads all streams.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)