mzabaluev opened a new issue, #9575:
URL: https://github.com/apache/arrow-rs/issues/9575

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   In applications implementing evolution of data schemas abstracted above 
Avro, such as Iceberg, there is a need to resolve the Arrow schema that is 
required for the output record batches against the writer schema in the Avro 
files.
   
   One example is datafusion-comet, where Spark passes down a SQL schema to 
read data from Avro files, which is converted to an Arrow schema for the 
"native" reader written in Rust, which streams RecordBatch items with that 
schema.
   
   **Describe the solution you'd like**
   
   A utility function to resolve the reader schema from these arguments:
   
   1. The Avro writer schema, which can be read from e.g. an OCF file using the 
`HeaderInfo` API added in #9548,
   2. The required Arrow schema for the output batches.
   
   The resulting Avro schema can be used to construct an (async) Avro reader 
using the `build_with_header` method.
   
   The behavior of the recursive schema resolution, where it differs from Avro 
resolution rules:
   
   * Struct/record fields are matched by name. Reordering of fields is allowed 
and should be reported as a modification of the writer schema.
   * Fields in the Arrow schema not present in the writer schema are added to 
the reader schema with the conventional nullable union type and the default 
value of `null`. It is an error if an added field is not nullable.
   * As Arrow struct types do not have names, the resulting record type in the 
reader schema receives the name and namespace attributes of the corresponding 
record in the writer schema.
   * The name of an Arrow list item field is not attested in the corresponding 
Avro array schema. The record batch will have the "item" list field as 
[currently 
hardcoded](https://github.com/apache/arrow-rs/blob/88422cbdcbfa8f4e2411d66578dd3582fafbf2a1/arrow-avro/src/reader/record.rs#L462).
 A metadata approach for round-tripping can be further proposed.
   * Likewise, the names in the KV struct field making a map are not attested 
in Avro.
   
   The resolution function should provide a way to determine if the resulting 
schema differs from the writer schema, so that the exact match could be 
processed in a fast path with no schema resolution.
   
   **Describe alternatives you've considered**
   
   The required Arrow schema could be passed to a builder method while 
constructing the reader, directly becoming the schema for the produced record 
batches. This would create duality with the reader schema option, and it's not 
clear how the two options would interact if used together.
   
   **Additional context**
   
   The wider topic of schema evolution and other adaptations, not specific to 
Avro, is discussed in #6735.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to