alamb commented on code in PR #5885:
URL: https://github.com/apache/arrow-rs/pull/5885#discussion_r1639698623


##########
arrow-array/src/array/byte_view_array.rs:
##########
@@ -265,6 +265,56 @@ impl<T: ByteViewType + ?Sized> GenericByteViewArray<T> {
             phantom: Default::default(),
         }
     }
+
+    /// Returns a buffer compact version of this array
+    ///
+    /// The original array will *not* be modified
+    ///
+    /// # Garbage Collection
+    ///
+    /// Before GC:
+    /// ```text
+    ///                                        ┌──────┐                 
+    ///                                        │......│                 
+    ///                                        │......│                 
+    /// ┌────────────────────┐       ┌ ─ ─ ─ ▶ │Data1 │   Large buffer  
+    /// │       View 1       │─ ─ ─ ─          │......│  with data that
+    /// ├────────────────────┤                 │......│ is not referred
+    /// │       View 2       │─ ─ ─ ─ ─ ─ ─ ─▶ │Data2 │ to by View 1 or
+    /// └────────────────────┘                 │......│      View 2     
+    ///                                        │......│                 
+    ///    2 views, refer to                   │......│                 
+    ///   small portions of a                  └──────┘                 
+    ///      large buffer                                               
+    /// ```
+    ///                                                                
+    /// After GC:
+    ///
+    /// ```text
+    /// ┌────────────────────┐                 ┌─────┐    After gc, only
+    /// │       View 1       │─ ─ ─ ─ ─ ─ ─ ─▶ │Data1│     data that is  
+    /// ├────────────────────┤       ┌ ─ ─ ─ ▶ │Data2│    pointed to by  
+    /// │       View 2       │─ ─ ─ ─          └─────┘     the views is  
+    /// └────────────────────┘                                 left      
+    ///                                                                  
+    ///                                                                  
+    ///         2 views                                                  
+    /// ```
+    /// This method will compact the data buffers by recreating the view array 
and only include the data
+    /// that is pointed to by the views.
+    ///
+    /// Note that it will copy the array regardless of whether the original 
array is compact.
+    /// Use with caution as this can be an expensive operation, only use it 
when you are sure that the view
+    /// array is significantly smaller than when it is originally created, 
e.g., after filtering or slicing.
+    pub fn gc(&self) -> Self {
+        let mut builder = 
GenericByteViewBuilder::<T>::with_capacity(self.len());
+
+        for v in self.iter() {
+            builder.append_option(v);

Review Comment:
   I guess I was imagining that special casing copying inlined views could 
potentially be faster than converting to &str and then checking
   
   But I don't think we should make any changes unless we have benchmarks 
supporting that



##########
arrow-array/src/array/byte_view_array.rs:
##########
@@ -265,6 +265,56 @@ impl<T: ByteViewType + ?Sized> GenericByteViewArray<T> {
             phantom: Default::default(),
         }
     }
+
+    /// Returns a buffer compact version of this array
+    ///
+    /// The original array will *not* be modified
+    ///
+    /// # Garbage Collection
+    ///
+    /// Before GC:
+    /// ```text
+    ///                                        ┌──────┐                 
+    ///                                        │......│                 
+    ///                                        │......│                 
+    /// ┌────────────────────┐       ┌ ─ ─ ─ ▶ │Data1 │   Large buffer  
+    /// │       View 1       │─ ─ ─ ─          │......│  with data that
+    /// ├────────────────────┤                 │......│ is not referred
+    /// │       View 2       │─ ─ ─ ─ ─ ─ ─ ─▶ │Data2 │ to by View 1 or
+    /// └────────────────────┘                 │......│      View 2     
+    ///                                        │......│                 
+    ///    2 views, refer to                   │......│                 
+    ///   small portions of a                  └──────┘                 
+    ///      large buffer                                               
+    /// ```
+    ///                                                                
+    /// After GC:
+    ///
+    /// ```text
+    /// ┌────────────────────┐                 ┌─────┐    After gc, only
+    /// │       View 1       │─ ─ ─ ─ ─ ─ ─ ─▶ │Data1│     data that is  
+    /// ├────────────────────┤       ┌ ─ ─ ─ ▶ │Data2│    pointed to by  
+    /// │       View 2       │─ ─ ─ ─          └─────┘     the views is  
+    /// └────────────────────┘                                 left      
+    ///                                                                  
+    ///                                                                  
+    ///         2 views                                                  
+    /// ```
+    /// This method will compact the data buffers by recreating the view array 
and only include the data
+    /// that is pointed to by the views.
+    ///
+    /// Note that it will copy the array regardless of whether the original 
array is compact.
+    /// Use with caution as this can be an expensive operation, only use it 
when you are sure that the view
+    /// array is significantly smaller than when it is originally created, 
e.g., after filtering or slicing.
+    pub fn gc(&self) -> Self {
+        let mut builder = 
GenericByteViewBuilder::<T>::with_capacity(self.len());
+
+        for v in self.iter() {
+            builder.append_option(v);

Review Comment:
   I guess I was imagining that special casing copying inlined views could 
potentially be faster than converting to &str and then checking
   
   But I don't think we should make any changes unless we have benchmarks 
supporting that hypothesis



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to