[ 
https://issues.apache.org/jira/browse/FLINK-39035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiaolong3817 updated FLINK-39035:
---------------------------------
    Description: 
*Description:*

The Avro format offers several valuable features, including *fast-read* and 
{*}column pruning{*}. However, the current {{AvroDeserializationSchema}} does 
not leverage these capabilities. While {{RegistryAvroDeserializationSchema}} 
exists, its adoption cost and complexity are exceptionally high for many use 
cases.

I propose supporting these features in the standard 
{{AvroDeserializationSchema}} by introducing a few simple configuration 
parameters.

*Proposed Changes:*

*1. Support Fast Read ({{{}open.fastread{}}} / {{{}avro.fast-read.enabled{}}})* 
The fast-read capability can be enabled via a simple configuration parameter 
(defaulting to {{{}false{}}}). When users enable this feature, Avro internally 
caches a series of parsers. We only need to invoke {{setSchema}} for the 
specific class to activate this optimization.

*2. Support Writer Schema for Column Pruning 
({{{}avro.writer.schemaString{}}})* Avro inherently supports column pruning 
(projection). This is particularly useful when the upstream table contains a 
large number of columns, but the Flink SQL table only requires a subset of them.

Currently, when SQL users use the {{CREATE TABLE}} statement, the defined 
schema corresponds to the Avro {*}Reader Schema{*}.

We should support a parameter to explicitly specify the {*}Writer Schema{*}. 
Additionally, we can provide an API for upstream connectors (e.g., Kafka). 
Through projection pushdown, the connector can pass the actual schema to the 
deserializer. This allows the Avro format to efficiently skip reading 
unnecessary data fields, resulting in significant performance improvements.

  was:
*Description:*

The Avro format offers several valuable features, including *fast-read* and 
{*}column pruning{*}. However, the current {{AvroDeserializationSchema}} does 
not leverage these capabilities. While {{RegistryAvroDeserializationSchema}} 
exists, its adoption cost and complexity are exceptionally high for many use 
cases.

I propose supporting these features in the standard 
{{AvroDeserializationSchema}} by introducing a few simple configuration 
parameters.

*Proposed Changes:*

*1. Support Fast Read ({{{}open.fastread{}}} / {{{}avro.fast-read.enabled{}}})* 
The fast-read capability can be enabled via a simple configuration parameter 
(defaulting to {{{}false{}}}). When users enable this feature, Avro internally 
caches a series of parsers. We only need to invoke {{setSchema}} for the 
specific class to activate this optimization.

*2. Support Writer Schema for Column Pruning 
({{{}avro.writer.schema.string{}}})* Avro inherently supports column pruning 
(projection). This is particularly useful when the upstream table contains a 
large number of columns, but the Flink SQL table only requires a subset of them.

Currently, when SQL users use the {{CREATE TABLE}} statement, the defined 
schema corresponds to the Avro {*}Reader Schema{*}.

We should support a parameter to explicitly specify the {*}Writer Schema{*}. 
Additionally, we can provide an API for upstream connectors (e.g., Kafka). 
Through projection pushdown, the connector can pass the actual schema to the 
deserializer. This allows the Avro format to efficiently skip reading 
unnecessary data fields, resulting in significant performance improvements.


> Support Avro fast-read and column pruning in AvroDeserializationSchema via 
> configuration
> ----------------------------------------------------------------------------------------
>
>                 Key: FLINK-39035
>                 URL: https://issues.apache.org/jira/browse/FLINK-39035
>             Project: Flink
>          Issue Type: Improvement
>          Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>    Affects Versions: 2.3.0
>            Reporter: xiaolong3817
>            Priority: Major
>
> *Description:*
> The Avro format offers several valuable features, including *fast-read* and 
> {*}column pruning{*}. However, the current {{AvroDeserializationSchema}} does 
> not leverage these capabilities. While {{RegistryAvroDeserializationSchema}} 
> exists, its adoption cost and complexity are exceptionally high for many use 
> cases.
> I propose supporting these features in the standard 
> {{AvroDeserializationSchema}} by introducing a few simple configuration 
> parameters.
> *Proposed Changes:*
> *1. Support Fast Read ({{{}open.fastread{}}} / 
> {{{}avro.fast-read.enabled{}}})* The fast-read capability can be enabled via 
> a simple configuration parameter (defaulting to {{{}false{}}}). When users 
> enable this feature, Avro internally caches a series of parsers. We only need 
> to invoke {{setSchema}} for the specific class to activate this optimization.
> *2. Support Writer Schema for Column Pruning 
> ({{{}avro.writer.schemaString{}}})* Avro inherently supports column pruning 
> (projection). This is particularly useful when the upstream table contains a 
> large number of columns, but the Flink SQL table only requires a subset of 
> them.
> Currently, when SQL users use the {{CREATE TABLE}} statement, the defined 
> schema corresponds to the Avro {*}Reader Schema{*}.
> We should support a parameter to explicitly specify the {*}Writer Schema{*}. 
> Additionally, we can provide an API for upstream connectors (e.g., Kafka). 
> Through projection pushdown, the connector can pass the actual schema to the 
> deserializer. This allows the Avro format to efficiently skip reading 
> unnecessary data fields, resulting in significant performance improvements.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to