Re: [PR] GH-46128:[C++][Compute] Add CompactArray method to BinaryViewArray types [arrow]

via GitHub Tue, 01 Jul 2025 05:06:40 -0700


pitrou commented on code in PR #46229:
URL: https://github.com/apache/arrow/pull/46229#discussion_r2176990291



##########
cpp/src/arrow/array/array_binary.cc:
##########
@@ -104,7 +105,14 @@ BinaryViewArray::BinaryViewArray(std::shared_ptr<DataType> 
type, int64_t length,
   SetData(
       ArrayData::Make(std::move(type), length, std::move(buffers), null_count, 
offset));
 }
-
+Result<std::shared_ptr<Array>> BinaryViewArray::CompactArray(MemoryPool* pool) 
const {
+  std::unique_ptr<ArrayBuilder> builder;
+  ARROW_RETURN_NOT_OK(MakeBuilder(pool, type(), &builder));
+  ArraySpan span(*data());
+  // ArraySpan handles offset
+  ARROW_RETURN_NOT_OK(builder->AppendArraySlice(span, 0, span.length));

Review Comment:
   > I'm currently reviewing the implementation, and I think we should consider 
making some changes:
   > 
   >     1. **Bitmap Buffer**:
   >        It's possible to avoid copying the bitmap buffer entirely.
   
   Yes, certainly, since the array contains the same data logically (beware of 
the array `offset`, though).
   
   >     2. **View Buffer**:
   >        For the view buffer, we likely need to make a copy, since it may 
already be shared due to operations like `cast`, `arrow::Array::View`, or 
`arrow::ViewOrCopyTo`.
   >        However, I wonder if we could use `std::shared_ptr::use_count()` to 
detect whether the array is the sole owner of the view buffer and skip the copy 
in that case.
   
   We certainly should not do that. Array data is immutable in Arrow. Even if 
the `use_count` is 1, the data may be in use in another thread, or a raw 
pointer may have been obtained somewhere.
   
   >     3. **Data Buffer**:
   >        Three scenarios come to mind:
   >        3.1. The buffer is entirely empty — in such cases, we can safely 
discard it. This might occur after calling a compute function like 
`extract_regex` (note: this is a hypothetical example, as compute functions 
don’t support `StringView` yet).
   >        3.2. Only a small portion of the data buffer is used — for 
instance, if only one-tenth of the buffer is referenced by the view. In this 
case, copying the used portion to a new buffer helps prevent memory bloat.
   >        3.3. A significant portion of the buffer (e.g., 50% or more) is 
referenced — in such a case, it may be better to preserve the original data 
buffer, unless we're specifically trying to reduce memory usage.
   
   We could certainly expose a parameter to choose the ratio of unused buffer 
space above which to compact a buffer.
   
   > Additionally, there's another concern: it might be useful to expose a 
public API method to indicate whether calling `CompactArray` would be 
beneficial.
   
   Well, the decision is buffer-grained, not array-grained: some buffers might 
be worth compacting and some others not.
   
   > What do you think? Should I integrate this heuristic into `CompactArray`, 
or would it be better to split it into a separate method or even into a 
different pull request?
   
   As said above, I think it would make sense to integrate the heuristic and 
apply it at the buffer level.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-46128:[C++][Compute] Add CompactArray method to BinaryViewArray types [arrow]

Reply via email to