[GitHub] [arrow-rs] nevi-me commented on issue #208: flight_data_from_arrow_batch sends too much data

GitBox Mon, 13 Sep 2021 12:23:55 -0700


nevi-me commented on issue #208:
URL: https://github.com/apache/arrow-rs/issues/208#issuecomment-918505093



   @matthewmturner 
   
   > @jorgecarleitao I believe this is done in `write_generic_binary` using
   > 
   > ```
   > let first = *offsets.first().unwrap();
   > let last = *offsets.last().unwrap();
   > ```
   > 
   > and then writing to buffer based on those values.
   
   That will apply for strings, lists and binaries, but the overall problem is 
the below.
   
   We write`Buffer`s to IPC, and those buffers have a length and an offset 
(almost always 0). The problem is that when we write a buffer, we have to 
determine what its correct offset and length is, and the current APis in the 
crate can't give us that information conveniently.
   
   For example, if I have a list of i64 values:
   
   ```rust
   List:
     offset_buffer: [0, 1, 3,  6, 10] // 4 values
     child_data: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
     null_buffer: [T, F, T, F, T, F, T, T, T, F]
   ```
   
   There are 3 buffers to write.
   If the list gets sliced, `list.slice(2, 1)`, we now have a list that looks 
like:
   
   ```rust
   List:
     offset_buffer: [_, _, 3, 6, _]
     child_data: [_, _, _, 4, 5, 6, _, _, _, _]
     null_buffer: [_, _, _, F, T, F, _, _, _, _]
   ```
   
   In terms of the buffers, you have:
   
   ```rust
   buffer 1: type = i32, offset = 8 (2 * 4 bytes), len = 8 (2 * 4 bytes)
   buffer 2: type = i64, offset = 24 (3 * 8 bytes), len = 24 (3 * 8) // notice 
how the offset is 3 becaues of the list's first offset, and length is 3 because 
(6 - 3) on the offsets (and the child data has 3 values)
   buffer 3: type = bool, offset = 0 (3 offsets don't cross a byte boundary), 
len = 1 byte 0b00000_010
   ```
   
   The root of the challenge above comes from the signature of 
`arrow::buffer::immutable::Buffer`
   
   ```rust
   /// Buffer represents a contiguous memory region that can be shared with 
other buffers and across
   /// thread boundaries.
   #[derive(Clone, PartialEq, Debug)]
   pub struct Buffer {
       /// the internal byte buffer.
       data: Arc<Bytes>,
   
       /// The offset into the buffer.
       offset: usize,
   }
   ```
   
   and the current state that the only method that sets the `offset` above is 
   
   ```rust
       pub fn slice(&self, offset: usize) -> Self {
           assert!(
               offset <= self.len(),
               "the offset of the new Buffer cannot exceed the existing length"
           );
           Self {
               data: self.data.clone(),
               offset: self.offset + offset,
           }
       }
   ```
   
   One of the foundations of `arrow2` is that a `Buffer` knows its offset and 
length based on its content. If a string buffer is created with "hello", "you", 
"world", a slice of 2 means that the buffer will know to offset 8 bytes, making 
the IPC process easy (@jorgecarleitao this is my understanding without having 
checked the code as I write this).
   
   ___
   
   So, to only write the correct amount of data in IPC, my approach would be to 
modify `arrow::ipc::writer::fn write_array_data()` to account for the offset 
and correct length, and probably change `write_buffer()` in the same module to 
take the sliced bytes instead of `Buffer`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] nevi-me commented on issue #208: flight_data_from_arrow_batch sends too much data

Reply via email to