[DISCUSS][RUST]: Breaking Change to Schema Representation

Raphael Taylor-Davies Wed, 29 Mar 2023 08:49:30 -0700

Hi All,

The Rust Arrow implementation stores metadata for child arrays in astruct called Field [1]. This encodes the name, nullability, datatype,and other metadata about that array. Currently the various schemarepresentations, such as DataType [2] and Schema [3], store uniquelyowned Field, as either Vec<Field> or Box<Field>. This poses a couple ofchallenges:

1. Nested schema will encode the same Field, including separateallocations for the name and any metadata, in multiple redundantallocations at every level in the hierarchy2. The above nested schema will then be duplicated for every instance ofa nested array3. Projecting or cloning schema results in large amounts of cloning ofField names and metadata4. Looking up a Field by name requires a linear O(n^2) search through aVec<Field>

5. No cheap way to compare schema for pointer equality

Together these result in inefficient CPU and memory utilisation [4] [5],especially for tables with wide or nested schemas.


The proposed fix [6] for this is:

- Replace Box<Field> with Arc<Field> within the various schemarepresentations- Replace Vec<Field> with an opaque Fields type that approximatesArc<[Arc<Field>]>, see [7] for rationale and implementation

As this is necessarily a breaking change with downstream implications, Iwanted to solicit opinions on this approach, and would welcome any feedback


Kind Regards,

Raphael

[1]: https://docs.rs/arrow-schema/latest/arrow_schema/struct.Field.html
[2]: https://docs.rs/arrow-schema/latest/arrow_schema/enum.DataType.html
[3]: https://docs.rs/arrow-schema/latest/arrow_schema/struct.Schema.html
[4]: https://github.com/apache/arrow-datafusion/issues/5157
[5]: https://github.com/influxdata/influxdb_iox/issues/5202
[6]: https://github.com/apache/arrow-rs/issues/3955
[7]: https://github.com/apache/arrow-rs/pull/3965

[DISCUSS][RUST]: Breaking Change to Schema Representation

Reply via email to