[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9413: ARROW-11492: [Rust] Replaced many of the `Array::new` by `safe` counterparts

GitBox Sun, 14 Feb 2021 03:36:36 -0800


jorgecarleitao commented on a change in pull request #9413:
URL: https://github.com/apache/arrow/pull/9413#discussion_r575795146




##########
File path: rust/arrow/src/compute/kernels/boolean.rs
##########
@@ -288,27 +288,25 @@ where
 
     // Align/shift left data on offset as needed, since new bitmaps are 
shifted and aligned to 0 already
     // NOTE: this probably only works for primitive arrays.
-    let data_buffers = if left.offset() == 0 {
-        left_data.buffers().to_vec()
+    let buffer = if left.offset() == 0 {
+        left_data.buffers()[0].clone()
     } else {
         // Shift each data buffer by type's bit_width * offset.
-        left_data
-            .buffers()
-            .iter()
-            .map(|buf| buf.slice(left.offset() * T::get_byte_width()))
-            .collect::<Vec<_>>()
+        left_data.buffers()[0].slice(left.offset() * T::get_byte_width())
     };
 
+    // UNSOUND: when `offset != 0`, the sliced buffer's `len` will be smaller 
than `left.len()`

Review comment:
       I was about to write that the slicing problem would not be fixed with 
this, but I see that you just proposed something to address it. I (again) fully 
agree with you that we need `bit_width` to fix slicing, particularly for the 
Boolean Arrays.
   
   What I ended up doing in the other repo was to stop having bits handled in 
Buffers. I wrote a 
[`Bitmap`](https://github.com/jorgecarleitao/arrow2/blob/proposal/src/buffer/bitmap.rs#L13)
 that holds `Bytes`, and `MutableBitmap` that builds `Bytes` from bits / bool, 
and both are used exclusively for bit operations (everything is measured in 
bits). We can't to this here atm because all our arrays data resides in 
`ArrayData`, which means that the representation of the values of a 
`BooleanArray` must be a `Buffer`. This force us to use Buffers both from bits 
and byte-like types, which is what you are alluding to: the buffer must know 
what it contains.
   
   fyi, the proposed implementation needs the `length` because there we are not 
even storing the length on the array, and instead use the same as in values:
   
   ```rust
   #[derive(Debug, Clone)]
   pub struct PrimitiveArray<T: NativeType> {
       data_type: DataType,
       values: Buffer<T>,
       validity: Option<Bitmap>,
       offset: usize,
   }
   
   impl<T: NativeType> Array for PrimitiveArray<T> {
       #[inline]
       fn len(&self) -> usize {
           self.values.len()
       }
   }
   ```
   
   This follows a similar pattern used by `Tokio-bytes`'s `Bytes` for memory 
sharing:
   
   * `arrow::Bytes<T>` is an immutable and sharable memory region of type `T` 
that knows how to deallocate itself;
   * `Buffer` is a slice `[start,end]` of one of those regions
   * `Bitmap` is a bit slice `[start,end]` of `arrow::Bytes<u8>`
   
   Note how `PrimitiveArray<T: NativeType>` cannot be initialized with an 
offset in my proposal: the only way to offset it is via 
`PrimitiveArray::slice`, and there we offset both validity and values 
consistently. This way, we never end up in an inconsistent state nor have to 
ask users to pass a consistent state.
   
   Anyway, this just to re-state that I am in full agreement with you over this 
issue, and those are what led me to try out getting rid of `ArrayData` 
altogether: imo it is not allowing us to use structs that are convenient for 
each of the types. The problem is that changing this has a major impact in our 
code-base.
   
   If we want to go through such a path, by now I know well what to do to 
migrate that to this repo, as I have been playing around on the other to 
understand the risks. I am convinced we can do it, but need a good 
orchestration because there are certain idioms in the current code base that we 
must stop using and remove. The simpler answer is that everything that uses an 
`ArrayData` must be replaced by calls to methods from the concrete 
implementations. Atm, that I noticed, these are
   
   * equal/
   * transform/
   * slicing
   * IPC
   * FFI
   * Parquet
   * some of the JSON reader
   * some cast implementations
   
   One proposal:
   
   1. ground work: re-write the crate in a concrete-type specific way, i.e. we 
need to basically remove `Array::data()` and `Array::data_ref()`, and replace 
them by concrete types' methods such as `values()`, `offsets()`, `fields()`, 
etc.
   2. push the core of the proposal to a separate branch (say 
`proposal-rust2`), and PR it to a second branch, say `rust2`. This will trigger 
a formal discussion between us about it until we reach a consensus about it, at 
which point we merge it to `rust2`.
   3. migrate each component to the new API.
   4. Migrate DataFusion and Parquet.
   
   This has the same / higher risks that the parquet writer branch had, though, 
but I can't really see an easier way of getting rid of `ArrayData`: our repo 
depends on it all the way from IPC and FFI to the filter kernel!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9413: ARROW-11492: [Rust] Replaced many of the `Array::new` by `safe` counterparts

Reply via email to