alamb commented on code in PR #5885:
URL: https://github.com/apache/arrow-rs/pull/5885#discussion_r1639698623
##########
arrow-array/src/array/byte_view_array.rs:
##########
@@ -265,6 +265,56 @@ impl<T: ByteViewType + ?Sized> GenericByteViewArray<T> {
phantom: Default::default(),
}
}
+
+ /// Returns a buffer compact version of this array
+ ///
+ /// The original array will *not* be modified
+ ///
+ /// # Garbage Collection
+ ///
+ /// Before GC:
+ /// ```text
+ /// ┌──────┐
+ /// │......│
+ /// │......│
+ /// ┌────────────────────┐ ┌ ─ ─ ─ ▶ │Data1 │ Large buffer
+ /// │ View 1 │─ ─ ─ ─ │......│ with data that
+ /// ├────────────────────┤ │......│ is not referred
+ /// │ View 2 │─ ─ ─ ─ ─ ─ ─ ─▶ │Data2 │ to by View 1 or
+ /// └────────────────────┘ │......│ View 2
+ /// │......│
+ /// 2 views, refer to │......│
+ /// small portions of a └──────┘
+ /// large buffer
+ /// ```
+ ///
+ /// After GC:
+ ///
+ /// ```text
+ /// ┌────────────────────┐ ┌─────┐ After gc, only
+ /// │ View 1 │─ ─ ─ ─ ─ ─ ─ ─▶ │Data1│ data that is
+ /// ├────────────────────┤ ┌ ─ ─ ─ ▶ │Data2│ pointed to by
+ /// │ View 2 │─ ─ ─ ─ └─────┘ the views is
+ /// └────────────────────┘ left
+ ///
+ ///
+ /// 2 views
+ /// ```
+ /// This method will compact the data buffers by recreating the view array
and only include the data
+ /// that is pointed to by the views.
+ ///
+ /// Note that it will copy the array regardless of whether the original
array is compact.
+ /// Use with caution as this can be an expensive operation, only use it
when you are sure that the view
+ /// array is significantly smaller than when it is originally created,
e.g., after filtering or slicing.
+ pub fn gc(&self) -> Self {
+ let mut builder =
GenericByteViewBuilder::<T>::with_capacity(self.len());
+
+ for v in self.iter() {
+ builder.append_option(v);
Review Comment:
I guess I was imagining that special casing copying inlined views could
potentially be faster than converting to &str and then checking
But I don't think we should make any changes unless we have benchmarks
supporting that
##########
arrow-array/src/array/byte_view_array.rs:
##########
@@ -265,6 +265,56 @@ impl<T: ByteViewType + ?Sized> GenericByteViewArray<T> {
phantom: Default::default(),
}
}
+
+ /// Returns a buffer compact version of this array
+ ///
+ /// The original array will *not* be modified
+ ///
+ /// # Garbage Collection
+ ///
+ /// Before GC:
+ /// ```text
+ /// ┌──────┐
+ /// │......│
+ /// │......│
+ /// ┌────────────────────┐ ┌ ─ ─ ─ ▶ │Data1 │ Large buffer
+ /// │ View 1 │─ ─ ─ ─ │......│ with data that
+ /// ├────────────────────┤ │......│ is not referred
+ /// │ View 2 │─ ─ ─ ─ ─ ─ ─ ─▶ │Data2 │ to by View 1 or
+ /// └────────────────────┘ │......│ View 2
+ /// │......│
+ /// 2 views, refer to │......│
+ /// small portions of a └──────┘
+ /// large buffer
+ /// ```
+ ///
+ /// After GC:
+ ///
+ /// ```text
+ /// ┌────────────────────┐ ┌─────┐ After gc, only
+ /// │ View 1 │─ ─ ─ ─ ─ ─ ─ ─▶ │Data1│ data that is
+ /// ├────────────────────┤ ┌ ─ ─ ─ ▶ │Data2│ pointed to by
+ /// │ View 2 │─ ─ ─ ─ └─────┘ the views is
+ /// └────────────────────┘ left
+ ///
+ ///
+ /// 2 views
+ /// ```
+ /// This method will compact the data buffers by recreating the view array
and only include the data
+ /// that is pointed to by the views.
+ ///
+ /// Note that it will copy the array regardless of whether the original
array is compact.
+ /// Use with caution as this can be an expensive operation, only use it
when you are sure that the view
+ /// array is significantly smaller than when it is originally created,
e.g., after filtering or slicing.
+ pub fn gc(&self) -> Self {
+ let mut builder =
GenericByteViewBuilder::<T>::with_capacity(self.len());
+
+ for v in self.iter() {
+ builder.append_option(v);
Review Comment:
I guess I was imagining that special casing copying inlined views could
potentially be faster than converting to &str and then checking
But I don't think we should make any changes unless we have benchmarks
supporting that hypothesis
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]