pitrou commented on code in PR #46229:
URL: https://github.com/apache/arrow/pull/46229#discussion_r2176990291
##########
cpp/src/arrow/array/array_binary.cc:
##########
@@ -104,7 +105,14 @@ BinaryViewArray::BinaryViewArray(std::shared_ptr<DataType>
type, int64_t length,
SetData(
ArrayData::Make(std::move(type), length, std::move(buffers), null_count,
offset));
}
-
+Result<std::shared_ptr<Array>> BinaryViewArray::CompactArray(MemoryPool* pool)
const {
+ std::unique_ptr<ArrayBuilder> builder;
+ ARROW_RETURN_NOT_OK(MakeBuilder(pool, type(), &builder));
+ ArraySpan span(*data());
+ // ArraySpan handles offset
+ ARROW_RETURN_NOT_OK(builder->AppendArraySlice(span, 0, span.length));
Review Comment:
> I'm currently reviewing the implementation, and I think we should consider
making some changes:
>
> 1. **Bitmap Buffer**:
> It's possible to avoid copying the bitmap buffer entirely.
Yes, certainly, since the array contains the same data logically (beware of
the array `offset`, though).
> 2. **View Buffer**:
> For the view buffer, we likely need to make a copy, since it may
already be shared due to operations like `cast`, `arrow::Array::View`, or
`arrow::ViewOrCopyTo`.
> However, I wonder if we could use `std::shared_ptr::use_count()` to
detect whether the array is the sole owner of the view buffer and skip the copy
in that case.
We certainly should not do that. Array data is immutable in Arrow. Even if
the `use_count` is 1, the data may be in use in another thread, or a raw
pointer may have been obtained somewhere.
> 3. **Data Buffer**:
> Three scenarios come to mind:
> 3.1. The buffer is entirely empty — in such cases, we can safely
discard it. This might occur after calling a compute function like
`extract_regex` (note: this is a hypothetical example, as compute functions
don’t support `StringView` yet).
> 3.2. Only a small portion of the data buffer is used — for
instance, if only one-tenth of the buffer is referenced by the view. In this
case, copying the used portion to a new buffer helps prevent memory bloat.
> 3.3. A significant portion of the buffer (e.g., 50% or more) is
referenced — in such a case, it may be better to preserve the original data
buffer, unless we're specifically trying to reduce memory usage.
We could certainly expose a parameter to choose the ratio of unused buffer
space above which to compact a buffer.
> Additionally, there's another concern: it might be useful to expose a
public API method to indicate whether calling `CompactArray` would be
beneficial.
Well, the decision is buffer-grained, not array-grained: some buffers might
be worth compacting and some others not.
> What do you think? Should I integrate this heuristic into `CompactArray`,
or would it be better to split it into a separate method or even into a
different pull request?
As said above, I think it would make sense to integrate the heuristic and
apply it at the buffer level.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]