emilk commented on issue #6360:
URL: https://github.com/apache/arrow-rs/issues/6360#issuecomment-2492135903

   I've been thinking a bit on how to implement this. It seems to be a lot of 
work, but doable.
   
   Ideally we should be able to take any `ArrayRef` (`Arc<dyn Array>`) and slim 
it down using a `shrink_to_fit` method.
   The desired capacity should probably be the required size rounded up to an 
even multiple of 64 bytes (following the lead of 
`MutableBuffer::shrink_to_fit`).
   
   One think we need to ensure is to only allocate and copy memory if the 
desired capacity is lower than the current capacity. That is: `shrink_to_fit` 
should be idempotent, and fast on the second call.
   
   ## `fn shrink_to_fit(&self) -> ArrayRef`
   In https://github.com/apache/arrow-rs/pull/6300#issuecomment-2328507438, 
@tustvold wrote:
   > I think that would be generally useful, and would also handle the case of 
sliced arrays. I'd recommend something that returns a new Array, e.g. `fn 
shrink_to_fit(&self) -> ArrayRef`.
   
   In other words:
   
   ```rs
   trait Array {
        …
       /// Return a clone of self using the minimal amount of memory,
       /// rounded up to the nearest multiple of 64 bytes.
        ///
        /// If the array is already using a minimal amount of memory, this 
returns a clone,
       /// which is usually cheap because everything bottoms out into a 
[`Buffer`] which is just a ref-counted [`Bytes`].
        fn shrink_to_fit(&self) -> ArrayRef;
   }
   ```
   
   This would still require a small memory-allocation for each call (a call to 
`Arc::new`), but the actual data would only need copying when there is actual 
opportunity for slimming down (e.g. when there is extra capacity and/or when 
the `Array` is a slice of some bigger chunk of memory).
   
   In practice, this would look something like this:
   
   ```rs
   impl Array for BooleanArray {
       …
        fn shrink_to_fit(&self) -> ArrayRef {
                Arc::new(Self{
                        values: self.values.shrink_to_fit(),
                        nulls: self.nulls.as_ref().map(|b| b.shrink_to_fit()),
                }}
        }
   }
   ```
   
   …assuming we also implement `fn shrink_to_fit(&self) -> Self` on 
`BooleanBuffer`, `NullBuffer` etc, which all bottom out in implementing 
   
   ```rs
   fn Buffer {
       fn shrik_to_fit(&self) -> Self {
           let new_capacity = bit_util::round_upto_multiple_of_64(self.len());
           if self.data.capacity <= new_capacity {
                        self.clone() // fast path
                } else {
                        // allocate new memory (`Bytes`) using the 
`std::alloc::alloc` and copy to it
                }
       }
   }
   ```
   
   ### Upsides
   You can call `.shrink_to_fit` on basically anything (all arrays and all 
buffers).
   It's all immutable by default.
   
   ### Downsides
   Each call will require at least one memory allocation, even in the fast case.
   For something like `StructArray` it will require more, since even a shallow 
clone of `StructArray` requires cloning 
https://github.com/apache/arrow-rs/blob/853626ef519bd68bfd604bf6d136f8677bedb1eb/arrow-array/src/array/struct_array.rs#L81
   
   There is not easy way to opt-out of this, though a user could use a 
heuristic like:
   
   ```rs
   fn maybe_shrink(array: ArrayRef) -> ArrayRef {
       if array.get_buffer_memory_size() < 1024 {
           array // not worth the overhead of calling shrink_to_fit
       } else {
           array.shrink_to_fit()
       }
   }
   ```
   
   Perhaps this is good enough?
   
   ## Alternative designs
   I don't really have any, except for the one in 
https://github.com/apache/arrow-rs/pull/6300
   
   I am new to the code base though, so I might be missing something.
   
   ```
   
   ## Request for feedback
   
   Thoughts @alamb @tustvold @teh-cmc ?
   
   If the `fn shrink_to_fit(&self) -> ArrayRef` plan sounds good, I can start 
implementing it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to