Re: [PR] [Variant] Avoid extra buffer allocation in ListBuilder [arrow-rs]

via GitHub Sat, 26 Jul 2025 05:43:53 -0700


alamb commented on PR #7987:
URL: https://github.com/apache/arrow-rs/pull/7987#issuecomment-3121763470


   > What do you mean by "shifting" and "appending" sorry? The buffer already 
contains the value bytes by the time we know the header info, so AFAIK we only 
have three choices:
   > 
   > 1. Guess (correctly!) beforehand how many header bytes are needed, and 
allocate space for them before appending the value bytes. Error-prone unless 
`splice` is used to replace the pre-allocated space with the actual header 
bytes.
   > 2. Directly splice in the header bytes (what this PR does). Safe, but 
still has to shift bytes over.
   > 3. Splice in a zero-byte region of the correct size to shift the bytes, 
and then loop back over in order to populate the region. Error-prone but 
doesn't need a temp vector.
   > 
   > Were you referring to one of the above? Or something else?
   
   I meant specifically, 
https://github.com/apache/arrow-rs/pull/7987/files#diff-19c7b0b0d73ef11489af7932f49046a19ec7790896a8960add5a3ded21d5657aR1230
 ( I thought I left a specific comment about this but I can't find it now 🤔 )
   
   Basically rather than allocating a new temporary vector to create the header 
and then splicing those bytes in like
   
   ```rust
           let mut bytes_to_splice = Vec::with_capacity(header_size + 3);
           // .... build header
           // slice 
           buffer
               .inner_mut()
               .splice(starting_offset..starting_offset, bytes_to_splice);
   ```
   
   I meant avoiding that allocation by shifting the byes over in one go and 
then writing directly into the output buffer:
   
   ```shell
           // insert header_size bytes into the output, shifting existing bytes 
down
           buffer.splice(starting_offset..starting_offset+header_length, 
std::iter::repeat(0));
          // write header directly into buffer[starting_offset], 
buffer[starting_offset+1], etc
   ```
   
   This looks somewhat similar to what @klion26 did in 
   > Splice with Iterator ([code 
here](https://github.com/klion26/arrow-rs/blob/1d594b3b4a461a44ad72ecac730cbdfc537767d4/parquet-variant/src/builder.rs#L1220-L1240))
   
   Though in that example the header is created during the insertion
   
   My suggestion is to merge this PR as is and then we can fiddle around with 
potentially other optimizations as a follow on PR
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [Variant] Avoid extra buffer allocation in ListBuilder [arrow-rs]

Reply via email to