[GitHub] [arrow-rs] tustvold opened a new issue, #1799: ArrayData Layout Enumeration

GitBox Mon, 06 Jun 2022 04:55:26 -0700


tustvold opened a new issue, #1799:
URL: https://github.com/apache/arrow-rs/issues/1799


   **TLDR**
   
   Make ArrayData layout explicit so that we can eventually push offsets down 
into the underlying buffers/bitmaps, instead of tracking them as a top-level 
concept which has proven to be rather error prone.
   
   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   Currently `ArrayData` is defined as follows.
   
   ```
   pub struct ArrayData {
       /// The data type for this array data
       data_type: DataType,
   
       /// The number of elements in this array data
       len: usize,
   
       /// The number of null elements in this array data
       null_count: usize,
   
       /// The offset into this array data, in number of items
       offset: usize,
   
       /// The buffers for this array data. Note that depending on the array 
types, this
       /// could hold different kinds of buffers (e.g., value buffer, value 
offset buffer)
       /// at different positions.
       buffers: Vec<Buffer>,
   
       /// The child(ren) of this array. Only non-empty for nested types, 
currently
       /// `ListArray` and `StructArray`.
       child_data: Vec<ArrayData>,
   
       /// The null bitmap. A `None` value for this indicates all values are 
non-null in
       /// this array.
       null_bitmap: Option<Bitmap>,
   }
   ```
   
   This is simple, but has a couple of caveats:
   
   * It isn't clear what is present for specific layout types
   * There is no clear path to storing `BooleanArray` as `BitMap` vs `Buffer`, 
which would allow removing `offset`
   * Vec allocations for one or two elements (the C++ implementation inlines 
these)
   * There is potential for accidentally interpreting a buffer incorrectly
   
   **Describe the solution you'd like**
   
   Introduce a new `ArrayDataLayout` enumeration:
   
   ```
   pub enum ArrayDataLayout {
     Boolean { values: Buffer },
     Primitive{ values: Buffer },
     Offsets { offsets: Buffer, values: Buffer },
     Dictionary { keys: Buffer, values: ArrayData },
     List { offsets: Buffer, elements: ArrayData },
     Struct { children: Vec<ArrayData> },
     Union { offsets: Option<Buffer>, types: Buffer, children: Vec<ArrayData> },
   }
   ```
   
   ```
   pub struct ArrayData {
       /// The data type for this array data
       data_type: DataType,
   
       /// The number of elements in this array data
       len: usize,
   
       /// The number of null elements in this array data
       null_count: usize,
   
       /// The offset into this array data, in number of items
       offset: usize,
   
       /// The null bitmap. A `None` value for this indicates all values are 
non-null in
       /// this array.
       null_bitmap: Option<Bitmap>,
   
       /// The array data layout
       layout: ArrayDataLayout
   }
   ```
   
   We could then progressively deprecate the methods that explicitly refer to 
buffers by index, etc...
   
   **Describe alternatives you've considered**
   
   We could not do this
   
   **Additional context**
   
   This could be seen as an evolution of @HaoYang670 's proposal in 
https://github.com/apache/arrow-rs/issues/1640
   
   It also relates to @jhorstmann 's proposal on 
https://github.com/apache/arrow-rs/pull/1499#issuecomment-1096878229
   
   It could also be seen as an interpretation of the arrow2 physical vs logical 
type separation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] tustvold opened a new issue, #1799: ArrayData Layout Enumeration

Reply via email to