[ 
https://issues.apache.org/jira/browse/BEAM-167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15224834#comment-15224834
 ] 

Eugene Kirpichov commented on BEAM-167:
---------------------------------------

Here's a test and a patch 
https://gist.github.com/jkff/d8d984a33a41ec607328cee8e418c174
(I haven't yet gone through the contribution guide steps. Will do as soon as I 
get to it; meanwhile anybody who has - feel free to use this directly).

> TextIO can't read concatenated gzip files
> -----------------------------------------
>
>                 Key: BEAM-167
>                 URL: https://issues.apache.org/jira/browse/BEAM-167
>             Project: Beam
>          Issue Type: Bug
>            Reporter: Eugene Kirpichov
>
> $ cat <<END > header.csv
> a,b,c
> END
> $ cat <<END > body.csv
> 1,2,3
> 4,5,6
> 7,8,9
> END
> $ gzip -c header.csv > file.gz
> $ gzip -c body.csv >> file.gz
> The file is well-formed:
> $ gzip -dc file.gz
> a,b,c
> 1,2,3
> 4,5,6
> 7,8,9
> However, TextIO.Read.from("/path/to/file.gz") will read only "a,b,c" - 
> reproducible even when the file is on local disk and with the 
> DirectPipelineRunner.
> The bug is in CompressedSource. It uses GzipCompressorInputStream, which by 
> default reads only the first gzip stream in the file, but has an option to 
> read all of them. Previously (in Dataflow SDK 1.4.0) we used GZIPInputStream 
> which reads all streams.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to