neilconway commented on code in PR #22029:
URL: https://github.com/apache/datafusion/pull/22029#discussion_r3236893353


##########
datafusion/functions/src/strings.rs:
##########
@@ -627,6 +733,138 @@ impl StringViewArrayBuilder {
         self.placeholder_count += 1;
     }
 
+    /// Ensure the in-progress block has room for `length` more bytes,
+    /// flushing the current block and starting a new (doubled) one if not.
+    /// Caller must only invoke this between rows — flushing mid-row would
+    /// orphan partial row data.
+    #[inline]
+    fn ensure_long_capacity(&mut self, length: u32) {
+        let required_cap = self.in_progress.len() + length as usize;
+        if self.in_progress.capacity() < required_cap {
+            self.flush_in_progress();
+            let to_reserve = (length as usize).max(self.next_block_size() as 
usize);
+            self.in_progress.reserve(to_reserve);
+        }
+    }
+
+    /// Encode a long-form view referencing `length` bytes already written
+    /// into the in-progress block at `offset`. `prefix_bytes` is the row's
+    /// data slice (or any slice starting with the row's first 4 bytes).
+    ///
+    /// Built inline rather than going through `make_view`, which is
+    /// `[inline(never)]`.
+    #[inline]
+    fn make_long_view(&self, length: u32, offset: u32, prefix_bytes: &[u8]) -> 
u128 {
+        let buffer_index: u32 = i32::try_from(self.completed.len())
+            .expect("buffer count exceeds i32::MAX")
+            as u32;
+        ByteView {
+            length,
+            // length > 12, so prefix_bytes has at least 4 bytes.
+            prefix: u32::from_le_bytes(prefix_bytes[..4].try_into().unwrap()),
+            buffer_index,
+            offset,
+        }
+        .into()
+    }
+
+    /// Append a row whose bytes are produced by mapping each byte of `src`
+    /// through `map`, in order. Output length equals `src.len()`.
+    ///
+    /// Because output length is known up front, this is more efficient than
+    /// `append_with` when computing a byte-to-byte mapping.
+    ///
+    /// # Safety
+    ///
+    /// The bytes produced by `map` over `src.iter()`, in order, must form
+    /// valid UTF-8.
+    ///
+    /// # Panics
+    ///
+    /// Panics under the same conditions as [`Self::append_value`]: if
+    /// `src.len()`, the in-progress buffer offset, or the number of completed
+    /// buffers exceeds `i32::MAX`.
+    #[inline]
+    pub unsafe fn append_byte_map<F: FnMut(u8) -> u8>(&mut self, src: &[u8], 
mut map: F) {
+        let length: u32 =
+            i32::try_from(src.len()).expect("value length exceeds i32::MAX") 
as u32;
+        if length <= 12 {

Review Comment:
   Hmm -- we already use `make_view` in the short-string code path. For long 
strings, I found that `make_long_view` is worth it: it avoids the 13-way match 
on string length and also can be inlined (`make_view` is marked never-inline). 
Lmk if that sounds reasonable or if there's another approach you think we 
should explore.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to