tustvold opened a new issue, #1799: URL: https://github.com/apache/arrow-rs/issues/1799
**TLDR** Make ArrayData layout explicit so that we can eventually push offsets down into the underlying buffers/bitmaps, instead of tracking them as a top-level concept which has proven to be rather error prone. **Is your feature request related to a problem or challenge? Please describe what you are trying to do.** Currently `ArrayData` is defined as follows. ``` pub struct ArrayData { /// The data type for this array data data_type: DataType, /// The number of elements in this array data len: usize, /// The number of null elements in this array data null_count: usize, /// The offset into this array data, in number of items offset: usize, /// The buffers for this array data. Note that depending on the array types, this /// could hold different kinds of buffers (e.g., value buffer, value offset buffer) /// at different positions. buffers: Vec<Buffer>, /// The child(ren) of this array. Only non-empty for nested types, currently /// `ListArray` and `StructArray`. child_data: Vec<ArrayData>, /// The null bitmap. A `None` value for this indicates all values are non-null in /// this array. null_bitmap: Option<Bitmap>, } ``` This is simple, but has a couple of caveats: * It isn't clear what is present for specific layout types * There is no clear path to storing `BooleanArray` as `BitMap` vs `Buffer`, which would allow removing `offset` * Vec allocations for one or two elements (the C++ implementation inlines these) * There is potential for accidentally interpreting a buffer incorrectly **Describe the solution you'd like** Introduce a new `ArrayDataLayout` enumeration: ``` pub enum ArrayDataLayout { Boolean { values: Buffer }, Primitive{ values: Buffer }, Offsets { offsets: Buffer, values: Buffer }, Dictionary { keys: Buffer, values: ArrayData }, List { offsets: Buffer, elements: ArrayData }, Struct { children: Vec<ArrayData> }, Union { offsets: Option<Buffer>, types: Buffer, children: Vec<ArrayData> }, } ``` ``` pub struct ArrayData { /// The data type for this array data data_type: DataType, /// The number of elements in this array data len: usize, /// The number of null elements in this array data null_count: usize, /// The offset into this array data, in number of items offset: usize, /// The null bitmap. A `None` value for this indicates all values are non-null in /// this array. null_bitmap: Option<Bitmap>, /// The array data layout layout: ArrayDataLayout } ``` We could then progressively deprecate the methods that explicitly refer to buffers by index, etc... **Describe alternatives you've considered** We could not do this **Additional context** This could be seen as an evolution of @HaoYang670 's proposal in https://github.com/apache/arrow-rs/issues/1640 It also relates to @jhorstmann 's proposal on https://github.com/apache/arrow-rs/pull/1499#issuecomment-1096878229 It could also be seen as an interpretation of the arrow2 physical vs logical type separation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org