HaoYang670 commented on issue #1750:
URL: https://github.com/apache/arrow-rs/issues/1750#issuecomment-1144347783

   Hi @tustvold 
   Here are some thoughts of mine:
   1. (Agree with you) We have to preserve `offset` on parent ArrayData (the 
struct array) because of the validity buffer.
   2. >Slice child data within StructArray when constructing boxed_fields
   If what you mean is to slice child arrays in `StructArray::from`:
   ```rust
       fn from(data: ArrayData) -> Self {
           let boxed_fields = data
               .child_data()
               .iter()
               .map(|cd| make_array(cd.slice(data.offset(), data.len())))
               .collect();
   
           Self { data, boxed_fields }
       }
   ```
   I am afraid this cannot solve the problem. Because
   - Inconsistent offsets still exist. For example, given a struct array:
   ```
   StructArray (offset = 0, length = 5, field0 = i32, field1 = bool, 
validity_buffer = ...)
   -- Int32Array (offset = 1, length = 5)
   -- BooleanArray (offset = 2, length = 5)
   ```
   After calling `array.slice(offset = 1, length = 3)`, we will get a struct 
array like this
   ```
   StructArray (offset = 1, length = 3, field0 = i32, field1 = bool, 
validity_buffer = ...)
   -- Int32Array (offset = 2, length = 3)
   -- BooleanArray (offset = 3, length = 3)
   ```
   We still have 3 `offset`s in parent array and child arrays. And it is not 
clear whether we have pushed down the new offset to children.
   3. My suggestion (workaround)
   - Do not push down the offset to child arrays when slicing (aka, just update 
the offset and null_count of parent array)
   - Do not slicing child arrays when building `boxed_fields`. Because this 
will produce inconsistent offsets
   - Slice child arrays when you call `array.column(idx)`. This is also the way 
of getting a buffer of array now:
   Example: getting the offsets of binary array.
   ```rust
       pub fn value_offsets(&self) -> &[OffsetSize] {
           unsafe {
               std::slice::from_raw_parts(
                   self.value_offsets.as_ptr().add(self.data.offset()),
                   self.len() + 1,
               )
           }
       }
   ```
   Slice when getting the child array. (API change, because we own the 
`ArrayRef` now)
   ```rust
       pub fn column(&self, pos: usize) -> ArrayRef {
           self.boxed_fields[pos].slice(self.offset, self.length)
       }
   ```
   - Long term suggestion: push down `offset` and `length` to 
`Buffer`(`Bitmap`).
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to