[
https://issues.apache.org/jira/browse/BEAM-167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Davor Bonaci updated BEAM-167:
------------------------------
Component/s: (was: sdk-java-extensions)
sdk-java-core
> TextIO can't read concatenated gzip files
> -----------------------------------------
>
> Key: BEAM-167
> URL: https://issues.apache.org/jira/browse/BEAM-167
> Project: Beam
> Issue Type: Bug
> Components: sdk-java-core
> Reporter: Eugene Kirpichov
> Assignee: Luke Cwik
>
> $ cat <<END > header.csv
> a,b,c
> END
> $ cat <<END > body.csv
> 1,2,3
> 4,5,6
> 7,8,9
> END
> $ gzip -c header.csv > file.gz
> $ gzip -c body.csv >> file.gz
> The file is well-formed:
> $ gzip -dc file.gz
> a,b,c
> 1,2,3
> 4,5,6
> 7,8,9
> However, TextIO.Read.from("/path/to/file.gz") will read only "a,b,c" -
> reproducible even when the file is on local disk and with the
> DirectPipelineRunner.
> The bug is in CompressedSource. It uses GzipCompressorInputStream, which by
> default reads only the first gzip stream in the file, but has an option to
> read all of them. Previously (in Dataflow SDK 1.4.0) we used GZIPInputStream
> which reads all streams.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)