[
https://issues.apache.org/jira/browse/FLINK-39035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
xiaolong3817 updated FLINK-39035:
---------------------------------
Description:
*Description:*
The Avro format offers several valuable features, including *fast-read* and
{*}column pruning{*}. However, the current {{AvroDeserializationSchema}} does
not leverage these capabilities. While {{RegistryAvroDeserializationSchema}}
exists, its adoption cost and complexity are exceptionally high for many use
cases.
I propose supporting these features in the standard
{{AvroDeserializationSchema}} by introducing a few simple configuration
parameters.
*Proposed Changes:*
*1. Support Fast Read ({{{}open.fastread{}}} / {{{}avro.fast-read.enabled{}}})*
The fast-read capability can be enabled via a simple configuration parameter
(defaulting to {{{}false{}}}). When users enable this feature, Avro internally
caches a series of parsers. We only need to invoke {{setSchema}} for the
specific class to activate this optimization.
*2. Support Writer Schema for Column Pruning
({{{}avro.writer.schemaString{}}})* Avro inherently supports column pruning
(projection). This is particularly useful when the upstream table contains a
large number of columns, but the Flink SQL table only requires a subset of them.
Currently, when SQL users use the {{CREATE TABLE}} statement, the defined
schema corresponds to the Avro {*}Reader Schema{*}.
We should support a parameter to explicitly specify the {*}Writer Schema{*}.
Additionally, we can provide an API for upstream connectors (e.g., Kafka).
Through projection pushdown, the connector can pass the actual schema to the
deserializer. This allows the Avro format to efficiently skip reading
unnecessary data fields, resulting in significant performance improvements.
was:
*Description:*
The Avro format offers several valuable features, including *fast-read* and
{*}column pruning{*}. However, the current {{AvroDeserializationSchema}} does
not leverage these capabilities. While {{RegistryAvroDeserializationSchema}}
exists, its adoption cost and complexity are exceptionally high for many use
cases.
I propose supporting these features in the standard
{{AvroDeserializationSchema}} by introducing a few simple configuration
parameters.
*Proposed Changes:*
*1. Support Fast Read ({{{}open.fastread{}}} / {{{}avro.fast-read.enabled{}}})*
The fast-read capability can be enabled via a simple configuration parameter
(defaulting to {{{}false{}}}). When users enable this feature, Avro internally
caches a series of parsers. We only need to invoke {{setSchema}} for the
specific class to activate this optimization.
*2. Support Writer Schema for Column Pruning
({{{}avro.writer.schema.string{}}})* Avro inherently supports column pruning
(projection). This is particularly useful when the upstream table contains a
large number of columns, but the Flink SQL table only requires a subset of them.
Currently, when SQL users use the {{CREATE TABLE}} statement, the defined
schema corresponds to the Avro {*}Reader Schema{*}.
We should support a parameter to explicitly specify the {*}Writer Schema{*}.
Additionally, we can provide an API for upstream connectors (e.g., Kafka).
Through projection pushdown, the connector can pass the actual schema to the
deserializer. This allows the Avro format to efficiently skip reading
unnecessary data fields, resulting in significant performance improvements.
> Support Avro fast-read and column pruning in AvroDeserializationSchema via
> configuration
> ----------------------------------------------------------------------------------------
>
> Key: FLINK-39035
> URL: https://issues.apache.org/jira/browse/FLINK-39035
> Project: Flink
> Issue Type: Improvement
> Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
> Affects Versions: 2.3.0
> Reporter: xiaolong3817
> Priority: Major
>
> *Description:*
> The Avro format offers several valuable features, including *fast-read* and
> {*}column pruning{*}. However, the current {{AvroDeserializationSchema}} does
> not leverage these capabilities. While {{RegistryAvroDeserializationSchema}}
> exists, its adoption cost and complexity are exceptionally high for many use
> cases.
> I propose supporting these features in the standard
> {{AvroDeserializationSchema}} by introducing a few simple configuration
> parameters.
> *Proposed Changes:*
> *1. Support Fast Read ({{{}open.fastread{}}} /
> {{{}avro.fast-read.enabled{}}})* The fast-read capability can be enabled via
> a simple configuration parameter (defaulting to {{{}false{}}}). When users
> enable this feature, Avro internally caches a series of parsers. We only need
> to invoke {{setSchema}} for the specific class to activate this optimization.
> *2. Support Writer Schema for Column Pruning
> ({{{}avro.writer.schemaString{}}})* Avro inherently supports column pruning
> (projection). This is particularly useful when the upstream table contains a
> large number of columns, but the Flink SQL table only requires a subset of
> them.
> Currently, when SQL users use the {{CREATE TABLE}} statement, the defined
> schema corresponds to the Avro {*}Reader Schema{*}.
> We should support a parameter to explicitly specify the {*}Writer Schema{*}.
> Additionally, we can provide an API for upstream connectors (e.g., Kafka).
> Through projection pushdown, the connector can pass the actual schema to the
> deserializer. This allows the Avro format to efficiently skip reading
> unnecessary data fields, resulting in significant performance improvements.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)