rustyconover opened a new issue, #9444:
URL: https://github.com/apache/arrow-rs/issues/9444

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   The Arrow IPC format supports a `custom_metadata` field on the `Message` 
flatbuffer envelope 
([Message.fbs](https://github.com/apache/arrow/blob/main/format/Message.fbs#L154)),
 allowing per-batch metadata separate from schema-level metadata. Currently, 
the Rust `RecordBatch` struct has no `custom_metadata` field and the IPC 
reader/writer ignore it.
   
   PyArrow has supported this since v11.0.0 via `write_batch(batch, 
custom_metadata=...)` and `read_next_batch_with_custom_metadata()`. This means 
IPC files written by PyArrow with per-batch metadata lose that metadata when 
read by arrow-rs.
   
   **Describe the solution you'd like**
   
   1. Add a `custom_metadata: HashMap<String, String>` field to `RecordBatch` 
with accessor methods (`custom_metadata()`, `custom_metadata_mut()`, 
`with_custom_metadata()`, `into_parts_with_custom_metadata()`)
   2. IPC writer: serialize `custom_metadata` to the `Message` flatbuffer when 
writing record batches
   3. IPC reader: extract `custom_metadata` from the `Message` at all reader 
call sites (`FileDecoder`, `StreamReader`, `StreamDecoder`)
   4. arrow-flight: extract and propagate `custom_metadata` in 
`flight_data_to_arrow_batch`
   5. arrow-select: preserve `custom_metadata` through `filter_record_batch` 
and `take_record_batch`
   6. Preserve metadata through `slice()`, `project()`, `normalize()`, 
`with_schema()`, and `remove_column()`
   
   **Describe alternatives you've considered**
   
   - Storing per-batch metadata in schema-level metadata with a naming 
convention — this conflates two levels of metadata and doesn't match the IPC 
format's intent.
   - An `Option<HashMap<String, String>>` instead of `HashMap<String, String>` 
— `HashMap::new()` is zero-allocation so the overhead is minimal, and `Option` 
complicates every accessor for little gain.
   
   **Additional context**
   
   - `HashMap::new()` does not heap-allocate, so there is no performance 
concern for the default (empty metadata) case.
   - The existing `into_parts()` signature is unchanged for backward 
compatibility; a new `into_parts_with_custom_metadata()` is added.
   - Multi-batch merge operations (`concat_batches`, `interleave_record_batch`, 
`BatchCoalescer`) intentionally do not propagate per-batch metadata since the 
semantics are ambiguous when merging batches with different metadata.
   - Reuses existing `metadata_to_fb` (convert.rs) for writing and the KV 
extraction pattern for reading.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to