HaoYang670 commented on issue #1750:
URL: https://github.com/apache/arrow-rs/issues/1750#issuecomment-1144347783
Hi @tustvold
Here are some thoughts of mine:
1. (Agree with you) We have to preserve `offset` on parent ArrayData (the
struct array) because of the validity buffer.
2. >Slice child data within StructArray when constructing boxed_fields
If what you mean is to slice child arrays in `StructArray::from`:
```rust
fn from(data: ArrayData) -> Self {
let boxed_fields = data
.child_data()
.iter()
.map(|cd| make_array(cd.slice(data.offset(), data.len())))
.collect();
Self { data, boxed_fields }
}
```
I am afraid this cannot solve the problem. Because
- Inconsistent offsets still exist. For example, given a struct array:
```
StructArray (offset = 0, length = 5, field0 = i32, field1 = bool,
validity_buffer = ...)
-- Int32Array (offset = 1, length = 5)
-- BooleanArray (offset = 2, length = 5)
```
After calling `array.slice(offset = 1, length = 3)`, we will get a struct
array like this
```
StructArray (offset = 1, length = 3, field0 = i32, field1 = bool,
validity_buffer = ...)
-- Int32Array (offset = 2, length = 3)
-- BooleanArray (offset = 3, length = 3)
```
We still have 3 `offset`s in parent array and child arrays. And it is not
clear whether we have pushed down the new offset to children.
3. My suggestion (workaround)
- Do not push down the offset to child arrays when slicing (aka, just update
the offset and null_count of parent array)
- Do not slicing child arrays when building `boxed_fields`. Because this
will produce inconsistent offsets
- Slice child arrays when you call `array.column(idx)`. This is also the way
of getting a buffer of array now:
Example: getting the offsets of binary array.
```rust
pub fn value_offsets(&self) -> &[OffsetSize] {
unsafe {
std::slice::from_raw_parts(
self.value_offsets.as_ptr().add(self.data.offset()),
self.len() + 1,
)
}
}
```
Slice when getting the child array. (API change, because we own the
`ArrayRef` now)
```rust
pub fn column(&self, pos: usize) -> ArrayRef {
self.boxed_fields[pos].slice(self.offset, self.length)
}
```
- Long term suggestion: push down `offset` and `length` to
`Buffer`(`Bitmap`).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]