alamb commented on PR #7987: URL: https://github.com/apache/arrow-rs/pull/7987#issuecomment-3121763470
> What do you mean by "shifting" and "appending" sorry? The buffer already contains the value bytes by the time we know the header info, so AFAIK we only have three choices: > > 1. Guess (correctly!) beforehand how many header bytes are needed, and allocate space for them before appending the value bytes. Error-prone unless `splice` is used to replace the pre-allocated space with the actual header bytes. > 2. Directly splice in the header bytes (what this PR does). Safe, but still has to shift bytes over. > 3. Splice in a zero-byte region of the correct size to shift the bytes, and then loop back over in order to populate the region. Error-prone but doesn't need a temp vector. > > Were you referring to one of the above? Or something else? I meant specifically, https://github.com/apache/arrow-rs/pull/7987/files#diff-19c7b0b0d73ef11489af7932f49046a19ec7790896a8960add5a3ded21d5657aR1230 ( I thought I left a specific comment about this but I can't find it now 🤔 ) Basically rather than allocating a new temporary vector to create the header and then splicing those bytes in like ```rust let mut bytes_to_splice = Vec::with_capacity(header_size + 3); // .... build header // slice buffer .inner_mut() .splice(starting_offset..starting_offset, bytes_to_splice); ``` I meant avoiding that allocation by shifting the byes over in one go and then writing directly into the output buffer: ```shell // insert header_size bytes into the output, shifting existing bytes down buffer.splice(starting_offset..starting_offset+header_length, std::iter::repeat(0)); // write header directly into buffer[starting_offset], buffer[starting_offset+1], etc ``` This looks somewhat similar to what @klion26 did in > Splice with Iterator ([code here](https://github.com/klion26/arrow-rs/blob/1d594b3b4a461a44ad72ecac730cbdfc537767d4/parquet-variant/src/builder.rs#L1220-L1240)) Though in that example the header is created during the insertion My suggestion is to merge this PR as is and then we can fiddle around with potentially other optimizations as a follow on PR -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
