[
https://issues.apache.org/jira/browse/BEAM-7925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17340207#comment-17340207
]
Ismaël Mejía commented on BEAM-7925:
------------------------------------
With the support to pass a Hadoop Configuration in BEAM-11913 ParquetIO does
not need to have an explicit API to do both Column Projections or Filter
predicates since both can be achieved manually by users using the native
Parquet ways to do so:
* For Column Projections:
{color:#000000}AvroReadSupport{color}.setRequestedProjection({color:#871094}conf{color},
{color:#0033b3}projectionSchema{color}));
* For Filter Predicates: ParquetInputFormat.setFilterPredicate(conf,
filterPredicate)
This give full flexibility to users and disminishes maintenance on Beam side.
We rarely implement repeated features just to ease use if users can do it in
the IO native API. Of course this will be done automatically if the users rely
on Beam's SQL and the ParquetTable/FIlter implementations on BEAM-7929
> ParquetIO supports neither column projection nor filter predicate
> -----------------------------------------------------------------
>
> Key: BEAM-7925
> URL: https://issues.apache.org/jira/browse/BEAM-7925
> Project: Beam
> Issue Type: Improvement
> Components: io-java-parquet
> Affects Versions: 2.14.0
> Reporter: Neville Li
> Priority: P3
> Time Spent: 3h 20m
> Remaining Estimate: 0h
>
> Current {{ParquetIO}} supports neither column projection nor filter predicate
> which defeats the performance motivation of using Parquet in the first place.
> That's why we have our own implementation of
> [ParquetIO|https://github.com/spotify/scio/tree/master/scio-parquet/src] in
> Scio.
> Reading Parquet as Avro with column projection has some complications,
> namely, the resulting Avro records may be incomplete and will not survive
> ser/de. A workaround maybe provide a {{TypedRead}} interface that takes a
> {{Function<A, B>}} that maps invalid Avro {{A}} into user defined type {{B}}.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)