Re: Proposal: Enhance Spark from_avro for Seamless Schema Registry and Evolution Support

Xiao Li Tue, 13 Jan 2026 15:42:12 -0800

+1

Hyukjin Kwon <[email protected]> 于2026年1月13日周二 15:29写道：


> I am fine with adding this support.
>
> On Wed, 1 Oct 2025 at 02:22, Nimrod Ofek <[email protected]> wrote:
>
>> Hello Spark Developers,
>>
>> I'd like to bring attention to a significant limitation in the current 
>> *open-source
>> from_avro implementation* within Apache Spark SQL, especially regarding
>> its integration with the common *Kafka + Avro* ecosystem.
>>
>> The current design, which is largely "naive" in that it requires a
>> manually supplied, static schema, falls short of supporting the most basic
>> and prevalent streaming scenario: reading an *Avro-encoded Kafka topic
>> with schema evolution*.
>>
>> The Core Problem: Missing "Automatic" Schema Resolution
>>
>> When an Avro record is paired with a *Schema Registry* (like Confluent),
>> the standard procedure is:
>>
>>    1.
>>
>>    The record bytes contain a *Schema ID* header.
>>    2.
>>
>>    The consumer (Spark) uses this ID to fetch the corresponding *writer
>>    schema* from the registry.
>>    3.
>>
>>    The consumer also uses its desired *reader schema* (often the latest
>>    version).
>>    4.
>>
>>    The Avro library's core function performs *schema resolution* using
>>    both the writer and reader schemas. This is what handles *schema
>>    evolution* by automatically dropping old fields or applying default
>>    values for new fields.
>>
>> *Crucially, this entire process is currently missing from the open-source
>> Spark core.*
>>
>> Why This Is a Critical Gap:
>>
>>
>>    -
>>
>>    It forces users to rely on non-standard, and sometimes poorly
>>    maintained, third-party libraries (like the now-partially-stalled ABRiS
>>    project) or proprietary vendor extensions (like those available in
>>    Databricks - and also there it's partially supported).
>>    -
>>
>>    The absence of this feature makes the out-of-the-box Kafka-to-Spark
>>    data pipeline for Avro highly brittle, non-compliant with standard
>>    Avro/Schema Registry practices, and cumbersome to maintain when schemas
>>    inevitably change.
>>
>> Proposed Path Forward
>>
>> Given that this is an essential and ubiquitous pattern for using Spark
>> with Kafka, I strongly believe that *native Schema Registry integration
>> and automatic schema resolution must become a core feature of Apache Spark*
>> .
>>
>> This enhancement would not only bring Spark up to parity with standard
>> data engineering expectations but also significantly lower the barrier to
>> entry for building robust, schema-compliant streaming pipelines.
>>
>> I encourage the community to consider dedicating resources to integrating
>> this fundamental Avro deserialization logic into the core from_avro
>> function - I'll be happy to take part in it.
>>
>> Thank you for considering this proposal to make Spark an even more
>> powerful and streamlined tool for streaming data.
>>
>> Nimrod
>>
>

Re: Proposal: Enhance Spark from_avro for Seamless Schema Registry and Evolution Support

Reply via email to