[ 
https://issues.apache.org/jira/browse/FLINK-39035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-39035:
-----------------------------------
    Labels: pull-request-available  (was: )

> Support Avro fast-read and column pruning in AvroDeserializationSchema via 
> configuration
> ----------------------------------------------------------------------------------------
>
>                 Key: FLINK-39035
>                 URL: https://issues.apache.org/jira/browse/FLINK-39035
>             Project: Flink
>          Issue Type: Improvement
>          Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>    Affects Versions: 2.3.0
>            Reporter: xiaolong3817
>            Priority: Major
>              Labels: pull-request-available
>
> *Description:*
> The Avro format offers several valuable features, including *fast-read* and 
> {*}column pruning{*}. However, the current {{AvroDeserializationSchema}} does 
> not leverage these capabilities. While {{RegistryAvroDeserializationSchema}} 
> exists, its adoption cost and complexity are exceptionally high for many use 
> cases.
> I propose supporting these features in the standard 
> {{AvroDeserializationSchema}} by introducing a few simple configuration 
> parameters.
> *Proposed Changes:*
> *1. Support Fast Read ({{{}open.fastread{}}} / 
> {{{}avro.fast-read.enabled{}}})* The fast-read capability can be enabled via 
> a simple configuration parameter (defaulting to {{{}false{}}}). When users 
> enable this feature, Avro internally caches a series of parsers. We only need 
> to invoke {{setSchema}} for the specific class to activate this optimization.
> *2. Support Writer Schema for Column Pruning 
> ({{{}avro.writer.schemaString{}}})* Avro inherently supports column pruning 
> (projection). This is particularly useful when the upstream table contains a 
> large number of columns, but the Flink SQL table only requires a subset of 
> them.
> Currently, when SQL users use the {{CREATE TABLE}} statement, the defined 
> schema corresponds to the Avro {*}Reader Schema{*}.
> We should support a parameter to explicitly specify the {*}Writer Schema{*}. 
> Additionally, we can provide an API for upstream connectors (e.g., Kafka). 
> Through projection pushdown, the connector can pass the actual schema to the 
> deserializer. This allows the Avro format to efficiently skip reading 
> unnecessary data fields, resulting in significant performance improvements.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to