jorgecarleitao commented on a change in pull request #9413:
URL: https://github.com/apache/arrow/pull/9413#discussion_r575795146
##########
File path: rust/arrow/src/compute/kernels/boolean.rs
##########
@@ -288,27 +288,25 @@ where
// Align/shift left data on offset as needed, since new bitmaps are
shifted and aligned to 0 already
// NOTE: this probably only works for primitive arrays.
- let data_buffers = if left.offset() == 0 {
- left_data.buffers().to_vec()
+ let buffer = if left.offset() == 0 {
+ left_data.buffers()[0].clone()
} else {
// Shift each data buffer by type's bit_width * offset.
- left_data
- .buffers()
- .iter()
- .map(|buf| buf.slice(left.offset() * T::get_byte_width()))
- .collect::<Vec<_>>()
+ left_data.buffers()[0].slice(left.offset() * T::get_byte_width())
};
+ // UNSOUND: when `offset != 0`, the sliced buffer's `len` will be smaller
than `left.len()`
Review comment:
I was about to write that the slicing problem would not be fixed with
this, but I see that you just proposed something to address it. I (again) fully
agree with you that we need `bit_width` to fix slicing, particularly for the
Boolean Arrays.
What I ended up doing in the other repo was to stop having bits handled in
Buffers. I wrote a
[`Bitmap`](https://github.com/jorgecarleitao/arrow2/blob/proposal/src/buffer/bitmap.rs#L13)
that holds `Bytes`, and `MutableBitmap` that builds `Bytes` from bits / bool,
and both are used exclusively for bit operations (everything is measured in
bits). We can't to this here atm because all our arrays data resides in
`ArrayData`, which means that the representation of the values of a
`BooleanArray` must be a `Buffer`. This force us to use Buffers both from bits
and byte-like types, which is what you are alluding to: the buffer must know
what it contains.
fyi, the proposed implementation needs the `length` because there we are not
even storing the length on the array, and instead use the same as in values:
```rust
#[derive(Debug, Clone)]
pub struct PrimitiveArray<T: NativeType> {
data_type: DataType,
values: Buffer<T>,
validity: Option<Bitmap>,
offset: usize,
}
impl<T: NativeType> Array for PrimitiveArray<T> {
#[inline]
fn len(&self) -> usize {
self.values.len()
}
}
```
This follows a similar pattern used by `Tokio-bytes`'s `Bytes` for memory
sharing:
* `arrow::Bytes<T>` is an immutable and sharable memory region of type `T`
that knows how to deallocate itself;
* `Buffer` is a slice `[start,end]` of one of those regions
* `Bitmap` is a bit slice `[start,end]` of `arrow::Bytes<u8>`
Note how `PrimitiveArray<T: NativeType>` cannot be initialized with an
offset in my proposal: the only way to offset it is via
`PrimitiveArray::slice`, and there we offset both validity and values
consistently. This way, we never end up in an inconsistent state nor have to
ask users to pass a consistent state.
Anyway, this just to re-state that I am in full agreement with you over this
issue, and those are what led me to try out getting rid of `ArrayData`
altogether: imo it is not allowing us to use structs that are convenient for
each of the types. The problem is that changing this has a major impact in our
code-base.
If we want to go through such a path, by now I know well what to do to
migrate that to this repo, as I have been playing around on the other to
understand the risks. I am convinced we can do it, but need a good
orchestration because there are certain idioms in the current code base that we
must stop using and remove. The simpler answer is that everything that uses an
`ArrayData` must be replaced by calls to methods from the concrete
implementations. Atm, that I noticed, these are
* equal/
* transform/
* slicing
* IPC
* FFI
* Parquet
* some of the JSON reader
* some cast implementations
One proposal:
1. ground work: re-write the crate in a concrete-type specific way, i.e. we
need to basically remove `Array::data()` and `Array::data_ref()`, and replace
them by concrete types' methods such as `values()`, `offsets()`, `fields()`,
etc.
2. push the core of the proposal to a separate branch (say
`proposal-rust2`), and PR it to a second branch, say `rust2`. This will trigger
a formal discussion between us about it until we reach a consensus about it, at
which point we merge it to `rust2`.
3. migrate each component to the new API.
4. Migrate DataFusion and Parquet.
This has the same / higher risks that the parquet writer branch had, though,
but I can't really see an easier way of getting rid of `ArrayData`: our repo
depends on it all the way from IPC and FFI to the filter kernel!
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]