[
https://issues.apache.org/jira/browse/BEAM-8168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kenneth Knowles updated BEAM-8168:
----------------------------------
Status: Open (was: Triage Needed)
> Python GCSFileSystem failing with gzip content encoding
> -------------------------------------------------------
>
> Key: BEAM-8168
> URL: https://issues.apache.org/jira/browse/BEAM-8168
> Project: Beam
> Issue Type: Bug
> Components: io-py-gcp
> Affects Versions: 2.15.0
> Reporter: Daniel Ecer
> Priority: Major
>
> Google Storage supports gzip content encoding.
>
> While Apache Beam (Python) can correctly work with .gz files without content
> encoding.
> It however fails to handle .gz files that have content encoding applied.
> e.g. (the following would work run in a Jupyer notebook)
> {code:python}
> file_url_1 = 'gs://some-bucket/test1.gz'
> file_url_2 = 'gs://some-bucket/test2.gz'
> !echo 'my content' > /tmp/test
> # file 1 without content encoding
> !cat /tmp/test | gzip | gsutil cp - "{file_url_1}"
> # file 2 with content encoding
> !gsutil cp -Z /tmp/test "{file_url_2}"
> !gsutil cat "{file_url_1}" | zcat -
> # output: my content
> !gsutil cat "{file_url_2}" | zcat -
> # output: my content
> import apache_beam as beam
> from apache_beam.io.filesystem import CompressionTypes
> from apache_beam.io.filesystems import FileSystems
> print(beam.__version__)
> # output: 2.15.0
> with FileSystems.open(file_url_1,
> compression_type=CompressionTypes.UNCOMPRESSED) as fp:
> print(fp.read(10))
> # output: b'\x1f\x8b\x08\x00\x10\xd6r]\x00\x03'
> with FileSystems.open(file_url_1) as fp:
> print(fp.read(10))
> # output: b'my content'
> with FileSystems.open(file_url_2,
> compression_type=CompressionTypes.UNCOMPRESSED) as fp:
> print(fp.read(10))
> # output: b'my content'
> # (here I would expect the gzipped byte code)
> with FileSystems.open(file_url_2) as fp:
> print(fp.read(10))
> # exception: FailedToDecompressContent: Content purported to be compressed
> with gzip but failed to decompress.
> {code}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)