Hi All,
The Rust Arrow implementation stores metadata for child arrays in a
struct called Field [1]. This encodes the name, nullability, datatype,
and other metadata about that array. Currently the various schema
representations, such as DataType [2] and Schema [3], store uniquely
owned Field, as either Vec<Field> or Box<Field>. This poses a couple of
challenges:
1. Nested schema will encode the same Field, including separate
allocations for the name and any metadata, in multiple redundant
allocations at every level in the hierarchy
2. The above nested schema will then be duplicated for every instance of
a nested array
3. Projecting or cloning schema results in large amounts of cloning of
Field names and metadata
4. Looking up a Field by name requires a linear O(n^2) search through a
Vec<Field>
5. No cheap way to compare schema for pointer equality
Together these result in inefficient CPU and memory utilisation [4] [5],
especially for tables with wide or nested schemas.
The proposed fix [6] for this is:
- Replace Box<Field> with Arc<Field> within the various schema
representations
- Replace Vec<Field> with an opaque Fields type that approximates
Arc<[Arc<Field>]>, see [7] for rationale and implementation
As this is necessarily a breaking change with downstream implications, I
wanted to solicit opinions on this approach, and would welcome any feedback
Kind Regards,
Raphael
[1]: https://docs.rs/arrow-schema/latest/arrow_schema/struct.Field.html
[2]: https://docs.rs/arrow-schema/latest/arrow_schema/enum.DataType.html
[3]: https://docs.rs/arrow-schema/latest/arrow_schema/struct.Schema.html
[4]: https://github.com/apache/arrow-datafusion/issues/5157
[5]: https://github.com/influxdata/influxdb_iox/issues/5202
[6]: https://github.com/apache/arrow-rs/issues/3955
[7]: https://github.com/apache/arrow-rs/pull/3965