[jira] [Updated] (BEAM-11913) Add support for Hadoop configuration on ParquetIO

Jira Tue, 02 Mar 2021 07:49:04 -0800


     [ 
https://issues.apache.org/jira/browse/BEAM-11913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ismaël Mejía updated BEAM-11913:
--------------------------------
    Description: 
This is a common request from users and we did not do it in the past because we 
tried to avoid Hadoop objects in ParquetIO's public API. However there are 
valid reasons to do it:

1. Many functionalities of Parquet are configurable via public helper methods 
on Parquet that prepare data inside of Hadoop's Configuration object, e.g. 
Column Projection via 
`{color:#000000}AvroReadSupport{color}.setRequestedProjection({color:#871094}conf{color},
 {color:#871094}projectionSchema{color});` or Predicate Filters via 
`P{color:#000000}arquetInputFormat{color}.setFilterPredicate({color:#871094}sc{color}.hadoopConfiguration(),
 {color:#871094}filterPredicate{color});`. Giving access to those would allow 
power users to do advanced stuff without any maintenance on the IO side.

2. The main reason to avoid the Hadoop Configuration object was to align with 
future non Hadoop required APIs on Parquet see PARQUET-1126 for details but 
this does not seem that will happen soon.

  was:
We have discussed this issue in the past and we tried to avoid Hadoop objects 
in Parquet public API however there are two valid reasons for this:

1. Many functionalities of Parquet are configurable via public helper methods 
on Parquet that prepare data inside of Hadoop's Configuration object, e.g. 
Column Projection via 
`{color:#000000}AvroReadSupport{color}.setRequestedProjection({color:#871094}conf{color},
 {color:#871094}projectionSchema{color});` or Predicate Filters via 
`P{color:#000000}arquetInputFormat{color}.setFilterPredicate({color:#871094}sc{color}.hadoopConfiguration(),
 {color:#871094}filterPredicate{color});`. Giving access to those would allow 
power users to do advanced stuff without any maintenance on the IO side.



2. The main reason to avoid the Hadoop Configuration object was to align with 
future non Hadoop required APIs on Parquet see PARQUET-1126 for details but 
this does not seem that will happen soon.


> Add support for Hadoop configuration on ParquetIO
> -------------------------------------------------
>
>                 Key: BEAM-11913
>                 URL: https://issues.apache.org/jira/browse/BEAM-11913
>             Project: Beam
>          Issue Type: Improvement
>          Components: io-java-parquet
>            Reporter: Ismaël Mejía
>            Assignee: Ismaël Mejía
>            Priority: P2
>
> This is a common request from users and we did not do it in the past because 
> we tried to avoid Hadoop objects in ParquetIO's public API. However there are 
> valid reasons to do it:
> 1. Many functionalities of Parquet are configurable via public helper methods 
> on Parquet that prepare data inside of Hadoop's Configuration object, e.g. 
> Column Projection via 
> `{color:#000000}AvroReadSupport{color}.setRequestedProjection({color:#871094}conf{color},
>  {color:#871094}projectionSchema{color});` or Predicate Filters via 
> `P{color:#000000}arquetInputFormat{color}.setFilterPredicate({color:#871094}sc{color}.hadoopConfiguration(),
>  {color:#871094}filterPredicate{color});`. Giving access to those would allow 
> power users to do advanced stuff without any maintenance on the IO side.
> 2. The main reason to avoid the Hadoop Configuration object was to align with 
> future non Hadoop required APIs on Parquet see PARQUET-1126 for details but 
> this does not seem that will happen soon.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (BEAM-11913) Add support for Hadoop configuration on ParquetIO

Reply via email to