jecsand838 opened a new issue, #9233:
URL: https://github.com/apache/arrow-rs/issues/9233

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   `arrow-avro` currently contains Arrow -> Avro schema logic in `schema.rs` 
that was originally built as a *writer convenience* for when an Avro schema is 
not provided. As such, it was developed to make a best effort attempt at 
synthesizing an `AvroSchema` from an Arrow `Schema` (plus some Arrow schema 
metadata like `avro.name` / `avro.namespace`). 
   
   As `arrow-avro` adoption grows (OCF files, SOE frames, Confluent/Apicurio 
framing), we increasingly need schema behavior that is:
   - **Explicit** about whether we are using a real Avro schema vs inferring 
one,
   - **Modular** so it’s more maintainable (today `schema.rs` is a large 
multi-purpose module),
   - **Correct-by-construction** so downstream consumers don’t need to patch up 
inferred schemas or reimplement Avro schema editing.
   
   Real-world pain points motivating this include:
   - Schema inference from Arrow metadata alone can produce incorrect Avro 
schemas for nested named types (e.g. #8928: confusion between nested record 
*type name* and *field name*).
   - Downstream consumers (e.g. DataFusion) want to apply column projection at 
the Avro schema level without reimplementing Avro-aware projection and metadata 
handling (see #8923).
   - Tests/integration code often treat `{ Arrow schema + 
avro.name/avro.namespace }` as sufficient, but this is not reliable for all 
schemas, and brittle inference can break reader schema workflows.
   - The Arrow -> Avro path needs clearer configuration points (null-union 
ordering, naming strategy for generated nested types, metadata passthrough 
policy, etc.), but these knobs are currently either implicit, crate-private, or 
spread across helpers.
   
   **Describe the solution you'd like**
   
   Refactor / enhance `schema.rs` so Arrow -> Avro schema behavior is 
**explicit, modular, and correct-by-construction**, with APIs that clearly 
distinguish between these three fundamental schema conversion functions:
   1. **"Using the real schema"** (preferred for readers): consume the Avro 
writer schema (OCF header / schema registry / user-provided) and optionally 
transform it into a reader schema as needed (projection, evolution).
   2. **"Inferring a schema (defaults)"** (writer convenience): synthesize Avro 
JSON from an Arrow `Schema` when no Avro schema JSON is provided / embedded.
   3. **"Building an explicitly correct `AvroSchema` from an Arrow `Schema` 
(configured builder for users)"**: add an `ArrowToAvroSchemaBuilder` for 
constructing an `AvroSchema` from an Arrow `Schema` with explicit configuration 
knobs.
   
   Below are additional details for the proposed solution:
   
   **A) Introduce `ArrowToAvroSchemaBuilder`**
   
   Add a public builder style API along these lines:
   
   ```rust
   use arrow_schema::Schema as ArrowSchema;
   use arrow_avro::schema::AvroSchema;
   
   // minimal defaults (equivalent to today's best effort inference)
   let avro: AvroSchema = ArrowToAvroSchemaBuilder::new(&arrow_schema).build()?;
   
   // explicit configuration
   let avro: AvroSchema = ArrowToAvroSchemaBuilder::new(&arrow_schema)
       .with_root_name("User")
       .with_namespace("com.example")
       .with_doc("Schema inferred from Arrow")
       .with_nullability_order(Nullability::NullFirst)
       .with_strip_internal_arrow_metadata(true)
       .with_type_naming_strategy(TypeNamingStrategy::PathBased)
       .with_passthrough_metadata_policy(PassthroughMetadataPolicy::Default)
       .build()?;
   ````
   
   Initial builder "with_" knobs that would help correctness and downstream 
use-cases:
   
   * Root record identity:
   
     * `with_root_name(...)` (default: `AVRO_ROOT_RECORD_DEFAULT_NAME` or Arrow 
`avro.name`)
     * `with_namespace(...)` (default: Arrow `avro.namespace` if present)
     * `with_doc(...)` (default: Arrow `avro.doc` if present)
   * Nullability + unions:
   
     * `with_nullability_order(Nullability::NullFirst|NullSecond)` (default 
`NullFirst`, aligning with Avro union-default constraints)
   * Metadata behavior:
   
     * `with_strip_internal_arrow_metadata(bool)` (defaults to current behavior)
     * `with_passthrough_metadata_policy(...)` controlling how non-reserved 
Arrow metadata becomes Avro attributes (today there is logic for "passthrough 
metadata" that could become configurable)
   * Naming strategy for generated nested named types (records/enums/fixed):
   
     * `with_type_naming_strategy(...)` to guarantee deterministic and 
collision-free nested type names
     * (optional) `with_type_name_overrides(...)` for explicit mapping by Arrow 
field-path
   * Logical/extension type policy:
   
     * Define how Arrow logical/extension types map to Avro logical types, and 
what happens when unsupported (error vs fallback encoding)
   
   This builder should be positioned as the explicit advanced inference entry 
point, while keeping a simpler defaults path for writer convenience.
   
   **B) Make "use embedded schema" vs "infer schema" explicit**
   
   Today, the Arrow schema metadata key `SCHEMA_METADATA_KEY = "avro.schema"` 
can contain the full Avro schema JSON. When present, it is often preferable to 
use it verbatim to preserve exact schema identity across OCF/SOE/registry 
contexts.
   
   We should make this explicit and stable:
   
   * A clear helper for "use embedded Avro schema if present, else error" 
(reader-like behavior)
   * A clear helper for "use embedded schema if present, else infer" (writer 
convenience)
   
   (Exact API design TBD, but could be builder flags or separate helpers.)
   
   **C) Split `schema.rs` by responsibility (internal refactor)**
   
   `schema.rs` currently mixes multiple concerns. Refactor into a module layout 
that preserves the public API but improves maintainability and testability, for 
instance:
   
   * `schema::mod`: schema representation + serde + builder (Avro JSON)
   * `schema::store`: schema store, canonical form + Rabin/MD5/SHA256 
fingerprints
   * `schema::metadata`: Arrow schema metadata keys + embed/extract helpers 
(`avro.schema`, `avro.name`, `avro.namespace`, `avro.doc`, defaults/enums)
   * `schema::infer`: Arrow -> Avro inference logic (used by the builder)
   * `schema::project`: Avro-aware projection/pruning utilities (ties into 
#8923)
   * `schema::evolve`: Avro-aware evolution/extension utilities (also used by 
the builder)
   * (optional) `schema::compat` / `schema::resolve`: compatibility checks + 
clearer error reporting (path + failure reason)
   
   **D) Provide Avro-aware schema projection/evolution primitives**
   
   Centralize Avro schema pruning/projection in `arrow-avro` (rather than 
downstream).
   This is related to #8923 and would ideally live alongside the refactor so 
both "use real schema", "inference", and "builder" paths can share projection 
and evolution logic.
   
   **E) Deprecate `AvroSchema::try_from()`**
   
   Deprecate the existing `AvroSchema::try_from()` method and use 
`ArrowToAvroSchemaBuilder::new().build()?` in it's place. This shouldn't create 
any downstream behavior so long as `ArrowToAvroSchemaBuilder` matches 
`AvroSchema::try_from()` when no knobs are used.
   
   **Describe alternatives you've considered**
   
   1. **Continue fixing inference bugs incrementally without refactoring**
   
      * Risks continued complexity growth in `schema.rs` and makes it harder to 
reason about correctness across reader/writer/projection paths.
   
   2. **Require callers to always provide Avro schema JSON**
   
      * This removes the writer convenience path and doesn’t address 
projection/evolution needs or tests where schemas are partially specified via 
Arrow metadata.
   
   3. **Downstream projects implement Avro schema editing themselves**
   
      * This duplicates Avro-specific logic and encourages subtle divergences 
from `arrow-avro` behavior, especially around naming, metadata, and resolution.
   
   4. **Expose a single `InferOptions` struct instead of a builder**
   
      * Works initially, but becomes less ergonomic as options grow, and makes 
it harder to evolve without breaking call-sites. A builder provides a more 
extensible surface.
   
   **Additional context**
   
   * Related issues:
   
     * #8928 (nested named type: type name vs field name mismatch when 
generating schemas from Arrow-only metadata)
     * #8923 (need Avro-aware projection API in `ReaderBuilder` / centralize 
schema editing)
   * Relevant constants/metadata (current behavior to preserve where possible):
   
     * `SCHEMA_METADATA_KEY = "avro.schema"`
     * `AVRO_NAME_METADATA_KEY = "avro.name"`
     * `AVRO_NAMESPACE_METADATA_KEY = "avro.namespace"`
     * `AVRO_DOC_METADATA_KEY = "avro.doc"`
     * `AVRO_FIELD_DEFAULT_METADATA_KEY = "avro.field.default"`
     * `AVRO_ENUM_SYMBOLS_METADATA_KEY = "avro.enum.symbols"`
   * Avro spec considerations that influence inference defaults:
   
     * Union default values must match the first union branch, which is why 
`["null", T]` is typically preferred for optional fields:
       
[https://avro.apache.org/docs/1.11.1/specification/#unions](https://avro.apache.org/docs/1.11.1/specification/#unions)
   * This issue is intentionally large: the goal is to land a design that 
solves the schema limitations in `arrow-avro` in the long-run. This will need 
to be implemented either via sub-issues or smaller partial PRs.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to