jorgecarleitao commented on issue #1176: URL: https://github.com/apache/arrow-rs/issues/1176#issuecomment-1432616438
This comment is not about how we merge arrow-rs and arrow2, just trying to outline the overall issues I observed in arrow-rs that resulted in arrow2. It may serve as an inspiration to find a solution that combines both. ## 1. `Buffer` is untyped `ptr::read`ing a buffer requires knowing the layout it was created from. This means that the struct `Buffer` cannot be semantically used. In other words, the information it carries is insufficient to use it - something else, usually its position in the `ArrayData` and the `DataType`, is required to use it in any way. ## 2. `Buffer` logical memory region is unknown we need to know the `offset`, `length` and physical type to get its logical memory region. This is stored in ArrayData. Whether we need to offset the buffer or not with `ArrayData` is dependent on the physical type. How we should apply `offset` is also dependent on the physical type (offsets of bitmaps are measured in bits). Any user of `ArrayData` will need to correctly apply these rules when reading any of its `Buffer`s. ## 3. `ArrayData` is untyped We need to use `DataType` (a logical type) to discern, at runtime, how to interpret its buffers and children. I.e. any physical typing is erased from `ArrayData` when it is built. https://github.com/apache/arrow-rs/issues/1799 is an idea to address this. ## 4. Efficient use of `ArrayData` requires self-referencing This is done performance reasons: since accessing an element requires runtime indirection to the buffers of `ArrayData`, which is expensive, we need to self-reference them to get a good performance. Self-referencing is considered an anti-pattern in Rust. https://github.com/apache/arrow-rs/issues/1811 is one way to address this. ## 5. `Buffer` and `MutableBuffer` layout is inconsistent with `Vec` Most Rust uses `Vec`. However, we allocate according to cache lines irrespectively of the type, resulting in a layout incompatible with `Vec`. This results in our inability to easily data from `Buffer` / `MutableBuffer` back to a `Vec`. The other way is possible due to foreign allocated capabilities of `Buffer`. # How arrow2 solves these ## Problem 1. and 2. * `Buffer` is typed * `Buffer` contains an `offset` and `length` ```rust #[derive(Clone)] pub struct Buffer<T> { /// the internal byte buffer. data: Arc<Bytes<T>>, /// The offset into the buffer. offset: usize, // the length of the buffer. Given a region `data` of N bytes, [offset..offset+length] is visible // to this buffer. length: usize, } ``` * `Bitmap` is a first-class citizen ```rust #[derive(Clone)] pub struct Bitmap { bytes: Arc<Bytes<u8>>, // both are measured in bits. They are used to bound the bitmap to a region of Bytes. offset: usize, length: usize, // this is a cache: it is computed on initialization unset_bits: usize, } #[derive(Clone)] pub struct MutableBitmap { buffer: Vec<u8>, // invariant: length.saturating_add(7) / 8 == buffer.len(); length: usize, } ``` These together are necessary and sufficient to represent all physical regions of an Arrow Array. ## Problem 3. and 4. An array is composed by the individual parts. Examples: ```rust #[derive(Clone)] pub struct PrimitiveArray<T: NativeType> { data_type: DataType, values: Buffer<T>, validity: Option<Bitmap>, } #[derive(Clone)] pub struct BooleanArray { data_type: DataType, values: Bitmap, validity: Option<Bitmap>, } #[derive(Clone)] pub struct BinaryArray<O: Offset> { data_type: DataType, offsets: OffsetsBuffer<O>, // a newtype buffer where elements are guaranteed to be in increasing order values: Buffer<u8>, validity: Option<Bitmap>, } ``` Because `Buffer` implements Deref to Target `[T]`, we can access individual values without self-referencing. ## Problem 5. Arrow2 uses `Vec` as its "native" container. In particular, it does not use `MutableBuffer` and instead uses `std::vec::Vec`. There is no "freeze"; instead, there is `Arc<Bytes<T>>` (a small variation of `Arc<Vec<T>>` to allow for foreign allocated regions). # Wrap up What I am trying to say with "remove `ArrayData`" is that the moment we expose such API to users, we expose a significant amount of footguns, almost all of them `unsafe`. In my opinion, we should strive to remove such footguns. In my opinion, this could be done by exposing all the necessary functionality around individual arrays on the arrays themselves (or their Mutable counterparts), so that, as a user, you can only use those APIs. For example, ```rust impl<T: NativeType> PrimitiveArray<T> { pub fn try_new( data_type: DataType, values: Buffer<T>, validity: Option<Bitmap>, ) -> Result<Self, Error>; } ``` Provides a complete construction of a `PrimitiveArray` in Arrow2 - `ArrayData::null_count` is the unset bits `validity`, ArrayData::offset is `values.offset()`. A simpler construct (both are `O(1)`): ```rust impl<T: NativeType> PrimitiveArray<T> { pub fn from_vec(values: Vec<T>) -> Self { Self::try_new(T::PRIMITIVE.into(), values.into(), None).unwrap() } ``` With ```rust impl<T: NativeType> PrimitiveArray<T> { #[must_use] pub fn into_inner(self) -> (DataType, Buffer<T>, Option<Bitmap>) { let Self { data_type, values, validity, } = self; (data_type, values, validity) } } ``` a user reverts to its individual parts. Because both values and validity contain the individual offsets and physical information (via generic), `Buffer` behave like `bytes::Bytes` and `Bitmap` like a normal container (e.g. `iter()` skips by `offset` and runs up to `length`). I am not saying we need to do this in arrow-rs, just pointing out that this is the appeal of arrow2 - imo it provides a very easy mental model of the data and corresponding invariants because every struct encapsulates the necessary information (via types or values) to use it regardless of its origin (e.g. using `Buffer` requires information from its parent `ArrayData`). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
