[ 
https://issues.apache.org/jira/browse/BEAM-8818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16987966#comment-16987966
 ] 

Ismaël Mejía commented on BEAM-8818:
------------------------------------

Beam should support only the compression formats that Parquet itself supports. 
In that case we should support GZIP out of the box. tar.gz is probably not a 
good idea because the most common use case is a directory full of parquet files 
already compressed at the block level.. 

I did a test creating a Parquet file using gzip compression with Spark and then 
read it on Beam with Parquet and it worked out of the box.

That it worked surprised me given the line you mention, probably that parameter 
is ignored by the automatic inferring of the compression codec on Parquet's 
file metadata by the read code. We should probably update that line to be 
simply `CompressionTypes.AUTO` as it is in avroio.py.



Or are you having a explicit different exception [~ethansiew] ?

> beam.io.parquetio.ReadAllFromParquet from compressed tar.gz files
> -----------------------------------------------------------------
>
>                 Key: BEAM-8818
>                 URL: https://issues.apache.org/jira/browse/BEAM-8818
>             Project: Beam
>          Issue Type: Wish
>          Components: io-py-parquet
>    Affects Versions: 2.16.0
>            Reporter: Ethan Siew
>            Priority: Major
>
> Hi
> Is it possible to read from tar.gz compressed parquet files? Is there a 
> technical limitation to allow for this to happen? It seems to be hardcoded 
> here to read only UNCOMPRESSED parquet files: 
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/parquetio.py#L227]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to