kou commented on PR #43705:
URL: https://github.com/apache/arrow/pull/43705#issuecomment-2297749929

   > My first thought is that statistics should be attached to `ArrayData` 
rather than `Array`. Otherwise `Datum` (which stores only `ArrayData`) will not 
have access to statistics.
   
   I thought so and the first implementation uses the approach. But @felipecrv 
pointed out that it indicates that `ArrowStatistics` is mutable. But we don't 
want to maintain `ArrowStatistics` when `ArrayData` is changed.
   See also this discussion: 
https://github.com/apache/arrow/pull/42133#discussion_r1640175130
   
   We must call `ArrayData::GetMutableValues()` when we change `ArrayData`, 
right? If so, how about always resetting `ArrayData::statistics` when 
`ArrayData::GetMutableValues()` like the following? 
   
   ```diff
   diff --git a/cpp/src/arrow/array/data.h b/cpp/src/arrow/array/data.h
   index e0508fe698..60409d1367 100644
   --- a/cpp/src/arrow/array/data.h
   +++ b/cpp/src/arrow/array/data.h
   @@ -261,6 +261,7 @@ struct ARROW_EXPORT ArrayData {
      // Access a buffer's data as a typed C pointer
      template <typename T>
      inline T* GetMutableValues(int i, int64_t absolute_offset) {
   +    statistics = NULLPTR;
        if (buffers[i]) {
          return reinterpret_cast<T*>(buffers[i]->mutable_data()) + 
absolute_offset;
        } else {
   ```
   
   If it can prevent maintaining `ArrayStatistics` when `ArrayData` is changed, 
we may be able to attach `ArrayStatistics` to `ArrayData`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to