Eugene Kirpichov created BEAM-167:
-------------------------------------

             Summary: TextIO can't read concatenated gzip files
                 Key: BEAM-167
                 URL: https://issues.apache.org/jira/browse/BEAM-167
             Project: Beam
          Issue Type: Bug
            Reporter: Eugene Kirpichov


$ cat <<END > header.csv
a,b,c
END
$ cat <<END > body.csv
1,2,3
4,5,6
7,8,9
END
$ gzip -c header.csv > file.gz
$ gzip -c body.csv >> file.gz

The file is well-formed:
$ gzip -dc file.gz
a,b,c
1,2,3
4,5,6
7,8,9

However, TextIO.Read.from("/path/to/file.gz") will read only "a,b,c" - 
reproducible even when the file is on local disk and with the 
DirectPipelineRunner.

The bug is in CompressedSource. It uses GzipCompressorInputStream, which by 
default reads only the first gzip stream in the file, but has an option to read 
all of them. Previously (in Dataflow SDK 1.4.0) we used GZIPInputStream which 
reads all streams.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to