[
https://issues.apache.org/jira/browse/FLINK-39035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated FLINK-39035:
-----------------------------------
Labels: pull-request-available (was: )
> Support Avro fast-read and column pruning in AvroDeserializationSchema via
> configuration
> ----------------------------------------------------------------------------------------
>
> Key: FLINK-39035
> URL: https://issues.apache.org/jira/browse/FLINK-39035
> Project: Flink
> Issue Type: Improvement
> Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
> Affects Versions: 2.3.0
> Reporter: xiaolong3817
> Priority: Major
> Labels: pull-request-available
>
> *Description:*
> The Avro format offers several valuable features, including *fast-read* and
> {*}column pruning{*}. However, the current {{AvroDeserializationSchema}} does
> not leverage these capabilities. While {{RegistryAvroDeserializationSchema}}
> exists, its adoption cost and complexity are exceptionally high for many use
> cases.
> I propose supporting these features in the standard
> {{AvroDeserializationSchema}} by introducing a few simple configuration
> parameters.
> *Proposed Changes:*
> *1. Support Fast Read ({{{}open.fastread{}}} /
> {{{}avro.fast-read.enabled{}}})* The fast-read capability can be enabled via
> a simple configuration parameter (defaulting to {{{}false{}}}). When users
> enable this feature, Avro internally caches a series of parsers. We only need
> to invoke {{setSchema}} for the specific class to activate this optimization.
> *2. Support Writer Schema for Column Pruning
> ({{{}avro.writer.schemaString{}}})* Avro inherently supports column pruning
> (projection). This is particularly useful when the upstream table contains a
> large number of columns, but the Flink SQL table only requires a subset of
> them.
> Currently, when SQL users use the {{CREATE TABLE}} statement, the defined
> schema corresponds to the Avro {*}Reader Schema{*}.
> We should support a parameter to explicitly specify the {*}Writer Schema{*}.
> Additionally, we can provide an API for upstream connectors (e.g., Kafka).
> Through projection pushdown, the connector can pass the actual schema to the
> deserializer. This allows the Avro format to efficiently skip reading
> unnecessary data fields, resulting in significant performance improvements.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)