I am fine with adding this support.

On Wed, 1 Oct 2025 at 02:22, Nimrod Ofek <[email protected]> wrote:

> Hello Spark Developers,
>
> I'd like to bring attention to a significant limitation in the current 
> *open-source
> from_avro implementation* within Apache Spark SQL, especially regarding
> its integration with the common *Kafka + Avro* ecosystem.
>
> The current design, which is largely "naive" in that it requires a
> manually supplied, static schema, falls short of supporting the most basic
> and prevalent streaming scenario: reading an *Avro-encoded Kafka topic
> with schema evolution*.
>
> The Core Problem: Missing "Automatic" Schema Resolution
>
> When an Avro record is paired with a *Schema Registry* (like Confluent),
> the standard procedure is:
>
>    1.
>
>    The record bytes contain a *Schema ID* header.
>    2.
>
>    The consumer (Spark) uses this ID to fetch the corresponding *writer
>    schema* from the registry.
>    3.
>
>    The consumer also uses its desired *reader schema* (often the latest
>    version).
>    4.
>
>    The Avro library's core function performs *schema resolution* using
>    both the writer and reader schemas. This is what handles *schema
>    evolution* by automatically dropping old fields or applying default
>    values for new fields.
>
> *Crucially, this entire process is currently missing from the open-source
> Spark core.*
>
> Why This Is a Critical Gap:
>
>
>    -
>
>    It forces users to rely on non-standard, and sometimes poorly
>    maintained, third-party libraries (like the now-partially-stalled ABRiS
>    project) or proprietary vendor extensions (like those available in
>    Databricks - and also there it's partially supported).
>    -
>
>    The absence of this feature makes the out-of-the-box Kafka-to-Spark
>    data pipeline for Avro highly brittle, non-compliant with standard
>    Avro/Schema Registry practices, and cumbersome to maintain when schemas
>    inevitably change.
>
> Proposed Path Forward
>
> Given that this is an essential and ubiquitous pattern for using Spark
> with Kafka, I strongly believe that *native Schema Registry integration
> and automatic schema resolution must become a core feature of Apache Spark*
> .
>
> This enhancement would not only bring Spark up to parity with standard
> data engineering expectations but also significantly lower the barrier to
> entry for building robust, schema-compliant streaming pipelines.
>
> I encourage the community to consider dedicating resources to integrating
> this fundamental Avro deserialization logic into the core from_avro
> function - I'll be happy to take part in it.
>
> Thank you for considering this proposal to make Spark an even more
> powerful and streamlined tool for streaming data.
>
> Nimrod
>

Reply via email to