alamb commented on code in PR #6777:
URL: https://github.com/apache/arrow-rs/pull/6777#discussion_r1854769795


##########
arrow-array/src/builder/generic_bytes_view_builder.rs:
##########
@@ -281,7 +283,28 @@ impl<T: ByteViewType + ?Sized> GenericByteViewBuilder<T> {
     /// - String length exceeds `u32::MAX`
     #[inline]
     pub fn append_value(&mut self, value: impl AsRef<T::Native>) {
-        let v: &[u8] = value.as_ref().as_ref();
+        // SAFETY: the value is a valid native type (e.g. `&str` or `&[u8]`)
+        unsafe { self.append_bytes(value.as_ref().as_ref()) }
+    }
+
+    /// Appends a &str into the builder
+    ///
+    /// # Panics
+    ///
+    /// Panics if
+    /// - String buffer count exceeds `u32::MAX`
+    /// - String length exceeds `u32::MAX`
+    #[inline]
+    pub fn append_str(&mut self, value: &str) {
+        // SAFETY: UTF8 bytes are valid for both string and binaary
+        unsafe { self.append_bytes(value.as_bytes()) }
+    }
+
+    /// Appends bytes into this builder.
+    ///
+    /// Safety: caller must ensure value is a valid native type (e.g. valid
+    /// UTF-8 for `StringViewType`).
+    unsafe fn append_bytes(&mut self, v: &[u8]) {

Review Comment:
   We could potentially mark this as `pub` too but I figured I would keep the 
proposed API changes small



##########
arrow-array/src/builder/generic_bytes_view_builder.rs:
##########
@@ -281,7 +283,28 @@ impl<T: ByteViewType + ?Sized> GenericByteViewBuilder<T> {
     /// - String length exceeds `u32::MAX`
     #[inline]
     pub fn append_value(&mut self, value: impl AsRef<T::Native>) {
-        let v: &[u8] = value.as_ref().as_ref();
+        // SAFETY: the value is a valid native type (e.g. `&str` or `&[u8]`)
+        unsafe { self.append_bytes(value.as_ref().as_ref()) }
+    }
+
+    /// Appends a &str into the builder
+    ///
+    /// # Panics
+    ///
+    /// Panics if
+    /// - String buffer count exceeds `u32::MAX`
+    /// - String length exceeds `u32::MAX`
+    #[inline]
+    pub fn append_str(&mut self, value: &str) {

Review Comment:
   There may be a fancier way to write the generics to avoid introdicing a 
`append_str` and `append_bytes` API, but I couldn't find it. 



##########
arrow-array/src/builder/generic_bytes_view_builder.rs:
##########
@@ -88,6 +86,9 @@ pub struct GenericByteViewBuilder<T: ByteViewType + ?Sized> {
     /// map `<string hash> -> <index to the views>`
     string_tracker: Option<(HashTable<usize>, ahash::RandomState)>,
     phantom: PhantomData<T>,
+    /// temporary space used by [`ByteViewFormatter`]
+    /// to avoid extra allocations
+    temp_buffer: Option<String>,

Review Comment:
   This implementation allocates a single String which is used for all `write!` 
calls to the same StringViewBuilder, which should keep allocations to a mininum.
   
   However, this implementation will copy the strings **twice**:
   1. once into the temp buffer `String`
   2. once into the final view
   
   For short strings (less than 12 bytes) this copy is likely required anyways 
(unless we are manipulating the view buffer directly)
   
   For longer strings (greater than 12 bytes), we could avoid the extra copy 
into the `String` buffer by writing directly into the underlying buffer. 
   
   I actually tried to avoid the second copy, but found it was quite complex 
due to:
   1. Having to respect the  string_tracker (deduplication) 
   2. handle the block size re-allocation correctly
   
   Thus I went with the simple approach here which should get most of the 
benefits, and figured we can optimize further if needed.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to