[
https://issues.apache.org/jira/browse/BEAM-8818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16987966#comment-16987966
]
Ismaël Mejía commented on BEAM-8818:
------------------------------------
Beam should support only the compression formats that Parquet itself supports.
In that case we should support GZIP out of the box. tar.gz is probably not a
good idea because the most common use case is a directory full of parquet files
already compressed at the block level..
I did a test creating a Parquet file using gzip compression with Spark and then
read it on Beam with Parquet and it worked out of the box.
That it worked surprised me given the line you mention, probably that parameter
is ignored by the automatic inferring of the compression codec on Parquet's
file metadata by the read code. We should probably update that line to be
simply `CompressionTypes.AUTO` as it is in avroio.py.
Or are you having a explicit different exception [~ethansiew] ?
> beam.io.parquetio.ReadAllFromParquet from compressed tar.gz files
> -----------------------------------------------------------------
>
> Key: BEAM-8818
> URL: https://issues.apache.org/jira/browse/BEAM-8818
> Project: Beam
> Issue Type: Wish
> Components: io-py-parquet
> Affects Versions: 2.16.0
> Reporter: Ethan Siew
> Priority: Major
>
> Hi
> Is it possible to read from tar.gz compressed parquet files? Is there a
> technical limitation to allow for this to happen? It seems to be hardcoded
> here to read only UNCOMPRESSED parquet files:
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/parquetio.py#L227]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)