+1 Hyukjin Kwon <[email protected]> 于2026年1月13日周二 15:29写道:
> I am fine with adding this support. > > On Wed, 1 Oct 2025 at 02:22, Nimrod Ofek <[email protected]> wrote: > >> Hello Spark Developers, >> >> I'd like to bring attention to a significant limitation in the current >> *open-source >> from_avro implementation* within Apache Spark SQL, especially regarding >> its integration with the common *Kafka + Avro* ecosystem. >> >> The current design, which is largely "naive" in that it requires a >> manually supplied, static schema, falls short of supporting the most basic >> and prevalent streaming scenario: reading an *Avro-encoded Kafka topic >> with schema evolution*. >> >> The Core Problem: Missing "Automatic" Schema Resolution >> >> When an Avro record is paired with a *Schema Registry* (like Confluent), >> the standard procedure is: >> >> 1. >> >> The record bytes contain a *Schema ID* header. >> 2. >> >> The consumer (Spark) uses this ID to fetch the corresponding *writer >> schema* from the registry. >> 3. >> >> The consumer also uses its desired *reader schema* (often the latest >> version). >> 4. >> >> The Avro library's core function performs *schema resolution* using >> both the writer and reader schemas. This is what handles *schema >> evolution* by automatically dropping old fields or applying default >> values for new fields. >> >> *Crucially, this entire process is currently missing from the open-source >> Spark core.* >> >> Why This Is a Critical Gap: >> >> >> - >> >> It forces users to rely on non-standard, and sometimes poorly >> maintained, third-party libraries (like the now-partially-stalled ABRiS >> project) or proprietary vendor extensions (like those available in >> Databricks - and also there it's partially supported). >> - >> >> The absence of this feature makes the out-of-the-box Kafka-to-Spark >> data pipeline for Avro highly brittle, non-compliant with standard >> Avro/Schema Registry practices, and cumbersome to maintain when schemas >> inevitably change. >> >> Proposed Path Forward >> >> Given that this is an essential and ubiquitous pattern for using Spark >> with Kafka, I strongly believe that *native Schema Registry integration >> and automatic schema resolution must become a core feature of Apache Spark* >> . >> >> This enhancement would not only bring Spark up to parity with standard >> data engineering expectations but also significantly lower the barrier to >> entry for building robust, schema-compliant streaming pipelines. >> >> I encourage the community to consider dedicating resources to integrating >> this fundamental Avro deserialization logic into the core from_avro >> function - I'll be happy to take part in it. >> >> Thank you for considering this proposal to make Spark an even more >> powerful and streamlined tool for streaming data. >> >> Nimrod >> >
