[GitHub] [arrow-rs] jorgecarleitao commented on issue #1176: Discussion: relationship / unification of arrow-rs and arrow2 going forward

via GitHub Wed, 15 Feb 2023 23:03:03 -0800


jorgecarleitao commented on issue #1176:
URL: https://github.com/apache/arrow-rs/issues/1176#issuecomment-1432616438


   This comment is not about how we merge arrow-rs and arrow2, just trying to 
outline the overall issues I observed in arrow-rs that resulted in arrow2. It 
may serve as an inspiration to find a solution that combines both.
   
   ## 1. `Buffer` is untyped
   
   `ptr::read`ing a buffer requires knowing the layout it was created from. 
This means that the struct `Buffer` cannot be semantically used. In other 
words, the information it carries is insufficient to use it - something else, 
usually its position in the `ArrayData` and the `DataType`, is required to use 
it in any way.
   
   ## 2. `Buffer` logical memory region is unknown
   
   we need to know the `offset`, `length` and physical type to get its logical 
memory region. This is stored in ArrayData. Whether we need to offset the 
buffer or not with `ArrayData` is dependent on the physical type. How we should 
apply `offset` is also dependent on the physical type (offsets of bitmaps are 
measured in bits). 
   
   Any user of `ArrayData` will need to correctly apply these rules when 
reading any of its `Buffer`s.
   
   ## 3. `ArrayData` is untyped 
   
   We need to use `DataType` (a logical type) to discern, at runtime, how to 
interpret its buffers and children. I.e. any physical typing is erased from 
`ArrayData` when it is built.
   
   https://github.com/apache/arrow-rs/issues/1799 is an idea to address this.
   
   ## 4. Efficient use of `ArrayData` requires self-referencing
   
   This is done performance reasons: since accessing an element requires 
runtime indirection to the buffers of `ArrayData`, which is expensive, we need 
to self-reference them to get a good performance. Self-referencing is 
considered an anti-pattern in Rust.
   
   https://github.com/apache/arrow-rs/issues/1811 is one way to address this.
   
   ## 5. `Buffer` and `MutableBuffer` layout is inconsistent with `Vec`
   
   Most Rust uses `Vec`. However, we allocate according to cache lines 
irrespectively of the type, resulting in a layout incompatible with `Vec`. This 
results in our inability to easily data from `Buffer` / `MutableBuffer` back to 
a `Vec`. The other way is possible due to foreign allocated capabilities of 
`Buffer`.
   
   # How arrow2 solves these
   
   ## Problem 1. and 2.
   
   * `Buffer` is typed
   * `Buffer` contains an `offset` and `length`
   
   ```rust
   #[derive(Clone)]
   pub struct Buffer<T> {
       /// the internal byte buffer.
       data: Arc<Bytes<T>>,
   
       /// The offset into the buffer.
       offset: usize,
   
       // the length of the buffer. Given a region `data` of N bytes, 
[offset..offset+length] is visible
       // to this buffer.
       length: usize,
   }
   ```
   
   * `Bitmap` is a first-class citizen
   
   ```rust
   #[derive(Clone)]
   pub struct Bitmap {
       bytes: Arc<Bytes<u8>>,
       // both are measured in bits. They are used to bound the bitmap to a 
region of Bytes.
       offset: usize,
       length: usize,
       // this is a cache: it is computed on initialization
       unset_bits: usize,
   }
   
   #[derive(Clone)]
   pub struct MutableBitmap {
       buffer: Vec<u8>,
       // invariant: length.saturating_add(7) / 8 == buffer.len();
       length: usize,
   }
   ```
   
   These together are necessary and sufficient to represent all physical 
regions of an Arrow Array.
   
   ## Problem 3. and 4.
   
   An array is composed by the individual parts. Examples:
   
   ```rust
   #[derive(Clone)]
   pub struct PrimitiveArray<T: NativeType> {
       data_type: DataType,
       values: Buffer<T>,
       validity: Option<Bitmap>,
   }
   
   #[derive(Clone)]
   pub struct BooleanArray {
       data_type: DataType,
       values: Bitmap,
       validity: Option<Bitmap>,
   }
   
   #[derive(Clone)]
   pub struct BinaryArray<O: Offset> {
       data_type: DataType,
       offsets: OffsetsBuffer<O>,  // a newtype buffer where elements are 
guaranteed to be in increasing order
       values: Buffer<u8>,
       validity: Option<Bitmap>,
   }
   ```
   
   Because `Buffer` implements Deref to Target `[T]`, we can access individual 
values without self-referencing.
   
   ## Problem 5.
   
   Arrow2 uses `Vec` as its "native" container. In particular, it does not use 
`MutableBuffer` and instead uses `std::vec::Vec`. There is no "freeze"; 
instead, there is `Arc<Bytes<T>>` (a small variation of `Arc<Vec<T>>` to allow 
for foreign allocated regions).
   
   # Wrap up
   
   What I am trying to say with "remove `ArrayData`" is that the moment we 
expose such API to users, we expose a significant amount of footguns, almost 
all of them `unsafe`. In my opinion, we should strive to remove such footguns. 
In my opinion, this could be done by exposing all the necessary functionality 
around individual arrays on the arrays themselves (or their Mutable 
counterparts), so that, as a user, you can only use those APIs.
   
   For example, 
   
   ```rust
   impl<T: NativeType> PrimitiveArray<T> {
     pub fn try_new(
             data_type: DataType,
             values: Buffer<T>,
             validity: Option<Bitmap>,
         ) -> Result<Self, Error>;
   }
   ```
   
   Provides a complete construction of a `PrimitiveArray` in Arrow2 - 
`ArrayData::null_count` is the unset bits `validity`, ArrayData::offset is 
`values.offset()`.
   
   A simpler construct (both are `O(1)`):
   
   ```rust
   impl<T: NativeType> PrimitiveArray<T> {
       pub fn from_vec(values: Vec<T>) -> Self {
           Self::try_new(T::PRIMITIVE.into(), values.into(), None).unwrap()
       }
   ```
   
   With
   
   ```rust
   impl<T: NativeType> PrimitiveArray<T> {
       #[must_use]
       pub fn into_inner(self) -> (DataType, Buffer<T>, Option<Bitmap>) {
           let Self {
               data_type,
               values,
               validity,
           } = self;
           (data_type, values, validity)
       }
   }
   ```
   
   a user reverts to its individual parts. Because both values and validity 
contain the individual offsets and physical information (via generic), `Buffer` 
behave like `bytes::Bytes` and `Bitmap` like a normal container (e.g. `iter()` 
skips by `offset` and runs up to `length`).
   
   I am not saying we need to do this in arrow-rs, just pointing out that this 
is the appeal of arrow2 - imo it provides a very easy mental model of the data 
and corresponding invariants because every struct encapsulates the necessary 
information (via types or values) to use it regardless of its origin (e.g. 
using `Buffer` requires information from its parent `ArrayData`).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] jorgecarleitao commented on issue #1176: Discussion: relationship / unification of arrow-rs and arrow2 going forward

Reply via email to