I am fine with adding this support. On Wed, 1 Oct 2025 at 02:22, Nimrod Ofek <[email protected]> wrote:
> Hello Spark Developers, > > I'd like to bring attention to a significant limitation in the current > *open-source > from_avro implementation* within Apache Spark SQL, especially regarding > its integration with the common *Kafka + Avro* ecosystem. > > The current design, which is largely "naive" in that it requires a > manually supplied, static schema, falls short of supporting the most basic > and prevalent streaming scenario: reading an *Avro-encoded Kafka topic > with schema evolution*. > > The Core Problem: Missing "Automatic" Schema Resolution > > When an Avro record is paired with a *Schema Registry* (like Confluent), > the standard procedure is: > > 1. > > The record bytes contain a *Schema ID* header. > 2. > > The consumer (Spark) uses this ID to fetch the corresponding *writer > schema* from the registry. > 3. > > The consumer also uses its desired *reader schema* (often the latest > version). > 4. > > The Avro library's core function performs *schema resolution* using > both the writer and reader schemas. This is what handles *schema > evolution* by automatically dropping old fields or applying default > values for new fields. > > *Crucially, this entire process is currently missing from the open-source > Spark core.* > > Why This Is a Critical Gap: > > > - > > It forces users to rely on non-standard, and sometimes poorly > maintained, third-party libraries (like the now-partially-stalled ABRiS > project) or proprietary vendor extensions (like those available in > Databricks - and also there it's partially supported). > - > > The absence of this feature makes the out-of-the-box Kafka-to-Spark > data pipeline for Avro highly brittle, non-compliant with standard > Avro/Schema Registry practices, and cumbersome to maintain when schemas > inevitably change. > > Proposed Path Forward > > Given that this is an essential and ubiquitous pattern for using Spark > with Kafka, I strongly believe that *native Schema Registry integration > and automatic schema resolution must become a core feature of Apache Spark* > . > > This enhancement would not only bring Spark up to parity with standard > data engineering expectations but also significantly lower the barrier to > entry for building robust, schema-compliant streaming pipelines. > > I encourage the community to consider dedicating resources to integrating > this fundamental Avro deserialization logic into the core from_avro > function - I'll be happy to take part in it. > > Thank you for considering this proposal to make Spark an even more > powerful and streamlined tool for streaming data. > > Nimrod >
