kou commented on PR #43705: URL: https://github.com/apache/arrow/pull/43705#issuecomment-2297749929
> My first thought is that statistics should be attached to `ArrayData` rather than `Array`. Otherwise `Datum` (which stores only `ArrayData`) will not have access to statistics. I thought so and the first implementation uses the approach. But @felipecrv pointed out that it indicates that `ArrowStatistics` is mutable. But we don't want to maintain `ArrowStatistics` when `ArrayData` is changed. See also this discussion: https://github.com/apache/arrow/pull/42133#discussion_r1640175130 We must call `ArrayData::GetMutableValues()` when we change `ArrayData`, right? If so, how about always resetting `ArrayData::statistics` when `ArrayData::GetMutableValues()` like the following? ```diff diff --git a/cpp/src/arrow/array/data.h b/cpp/src/arrow/array/data.h index e0508fe698..60409d1367 100644 --- a/cpp/src/arrow/array/data.h +++ b/cpp/src/arrow/array/data.h @@ -261,6 +261,7 @@ struct ARROW_EXPORT ArrayData { // Access a buffer's data as a typed C pointer template <typename T> inline T* GetMutableValues(int i, int64_t absolute_offset) { + statistics = NULLPTR; if (buffers[i]) { return reinterpret_cast<T*>(buffers[i]->mutable_data()) + absolute_offset; } else { ``` If it can prevent maintaining `ArrayStatistics` when `ArrayData` is changed, we may be able to attach `ArrayStatistics` to `ArrayData`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
