xiaolong3817 created FLINK-39035:
------------------------------------

             Summary: Support Avro fast-read and column pruning in 
AvroDeserializationSchema via configuration
                 Key: FLINK-39035
                 URL: https://issues.apache.org/jira/browse/FLINK-39035
             Project: Flink
          Issue Type: Improvement
          Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
    Affects Versions: 2.3.0
            Reporter: xiaolong3817


*Description:*

The Avro format offers several valuable features, including *fast-read* and 
{*}column pruning{*}. However, the current {{AvroDeserializationSchema}} does 
not leverage these capabilities. While {{RegistryAvroDeserializationSchema}} 
exists, its adoption cost and complexity are exceptionally high for many use 
cases.

I propose supporting these features in the standard 
{{AvroDeserializationSchema}} by introducing a few simple configuration 
parameters.

*Proposed Changes:*

*1. Support Fast Read ({{{}open.fastread{}}} / {{{}avro.fast-read.enabled{}}})* 
The fast-read capability can be enabled via a simple configuration parameter 
(defaulting to {{{}false{}}}). When users enable this feature, Avro internally 
caches a series of parsers. We only need to invoke {{setSchema}} for the 
specific class to activate this optimization.

*2. Support Writer Schema for Column Pruning 
({{{}avro.writer.schema.string{}}})* Avro inherently supports column pruning 
(projection). This is particularly useful when the upstream table contains a 
large number of columns, but the Flink SQL table only requires a subset of them.

Currently, when SQL users use the {{CREATE TABLE}} statement, the defined 
schema corresponds to the Avro {*}Reader Schema{*}.

We should support a parameter to explicitly specify the {*}Writer Schema{*}. 
Additionally, we can provide an API for upstream connectors (e.g., Kafka). 
Through projection pushdown, the connector can pass the actual schema to the 
deserializer. This allows the Avro format to efficiently skip reading 
unnecessary data fields, resulting in significant performance improvements.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to