nevi-me commented on issue #208:
URL: https://github.com/apache/arrow-rs/issues/208#issuecomment-918505093
@matthewmturner
> @jorgecarleitao I believe this is done in `write_generic_binary` using
>
> ```
> let first = *offsets.first().unwrap();
> let last = *offsets.last().unwrap();
> ```
>
> and then writing to buffer based on those values.
That will apply for strings, lists and binaries, but the overall problem is
the below.
We write`Buffer`s to IPC, and those buffers have a length and an offset
(almost always 0). The problem is that when we write a buffer, we have to
determine what its correct offset and length is, and the current APis in the
crate can't give us that information conveniently.
For example, if I have a list of i64 values:
```rust
List:
offset_buffer: [0, 1, 3, 6, 10] // 4 values
child_data: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
null_buffer: [T, F, T, F, T, F, T, T, T, F]
```
There are 3 buffers to write.
If the list gets sliced, `list.slice(2, 1)`, we now have a list that looks
like:
```rust
List:
offset_buffer: [_, _, 3, 6, _]
child_data: [_, _, _, 4, 5, 6, _, _, _, _]
null_buffer: [_, _, _, F, T, F, _, _, _, _]
```
In terms of the buffers, you have:
```rust
buffer 1: type = i32, offset = 8 (2 * 4 bytes), len = 8 (2 * 4 bytes)
buffer 2: type = i64, offset = 24 (3 * 8 bytes), len = 24 (3 * 8) // notice
how the offset is 3 becaues of the list's first offset, and length is 3 because
(6 - 3) on the offsets (and the child data has 3 values)
buffer 3: type = bool, offset = 0 (3 offsets don't cross a byte boundary),
len = 1 byte 0b00000_010
```
The root of the challenge above comes from the signature of
`arrow::buffer::immutable::Buffer`
```rust
/// Buffer represents a contiguous memory region that can be shared with
other buffers and across
/// thread boundaries.
#[derive(Clone, PartialEq, Debug)]
pub struct Buffer {
/// the internal byte buffer.
data: Arc<Bytes>,
/// The offset into the buffer.
offset: usize,
}
```
and the current state that the only method that sets the `offset` above is
```rust
pub fn slice(&self, offset: usize) -> Self {
assert!(
offset <= self.len(),
"the offset of the new Buffer cannot exceed the existing length"
);
Self {
data: self.data.clone(),
offset: self.offset + offset,
}
}
```
One of the foundations of `arrow2` is that a `Buffer` knows its offset and
length based on its content. If a string buffer is created with "hello", "you",
"world", a slice of 2 means that the buffer will know to offset 8 bytes, making
the IPC process easy (@jorgecarleitao this is my understanding without having
checked the code as I write this).
___
So, to only write the correct amount of data in IPC, my approach would be to
modify `arrow::ipc::writer::fn write_array_data()` to account for the offset
and correct length, and probably change `write_buffer()` in the same module to
take the sliced bytes instead of `Buffer`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]