emilk commented on issue #6360: URL: https://github.com/apache/arrow-rs/issues/6360#issuecomment-2492135903
I've been thinking a bit on how to implement this. It seems to be a lot of work, but doable. Ideally we should be able to take any `ArrayRef` (`Arc<dyn Array>`) and slim it down using a `shrink_to_fit` method. The desired capacity should probably be the required size rounded up to an even multiple of 64 bytes (following the lead of `MutableBuffer::shrink_to_fit`). One think we need to ensure is to only allocate and copy memory if the desired capacity is lower than the current capacity. That is: `shrink_to_fit` should be idempotent, and fast on the second call. ## `fn shrink_to_fit(&self) -> ArrayRef` In https://github.com/apache/arrow-rs/pull/6300#issuecomment-2328507438, @tustvold wrote: > I think that would be generally useful, and would also handle the case of sliced arrays. I'd recommend something that returns a new Array, e.g. `fn shrink_to_fit(&self) -> ArrayRef`. In other words: ```rs trait Array { … /// Return a clone of self using the minimal amount of memory, /// rounded up to the nearest multiple of 64 bytes. /// /// If the array is already using a minimal amount of memory, this returns a clone, /// which is usually cheap because everything bottoms out into a [`Buffer`] which is just a ref-counted [`Bytes`]. fn shrink_to_fit(&self) -> ArrayRef; } ``` This would still require a small memory-allocation for each call (a call to `Arc::new`), but the actual data would only need copying when there is actual opportunity for slimming down (e.g. when there is extra capacity and/or when the `Array` is a slice of some bigger chunk of memory). In practice, this would look something like this: ```rs impl Array for BooleanArray { … fn shrink_to_fit(&self) -> ArrayRef { Arc::new(Self{ values: self.values.shrink_to_fit(), nulls: self.nulls.as_ref().map(|b| b.shrink_to_fit()), }} } } ``` …assuming we also implement `fn shrink_to_fit(&self) -> Self` on `BooleanBuffer`, `NullBuffer` etc, which all bottom out in implementing ```rs fn Buffer { fn shrik_to_fit(&self) -> Self { let new_capacity = bit_util::round_upto_multiple_of_64(self.len()); if self.data.capacity <= new_capacity { self.clone() // fast path } else { // allocate new memory (`Bytes`) using the `std::alloc::alloc` and copy to it } } } ``` ### Upsides You can call `.shrink_to_fit` on basically anything (all arrays and all buffers). It's all immutable by default. ### Downsides Each call will require at least one memory allocation, even in the fast case. For something like `StructArray` it will require more, since even a shallow clone of `StructArray` requires cloning https://github.com/apache/arrow-rs/blob/853626ef519bd68bfd604bf6d136f8677bedb1eb/arrow-array/src/array/struct_array.rs#L81 There is not easy way to opt-out of this, though a user could use a heuristic like: ```rs fn maybe_shrink(array: ArrayRef) -> ArrayRef { if array.get_buffer_memory_size() < 1024 { array // not worth the overhead of calling shrink_to_fit } else { array.shrink_to_fit() } } ``` Perhaps this is good enough? ## Alternative designs I don't really have any, except for the one in https://github.com/apache/arrow-rs/pull/6300 I am new to the code base though, so I might be missing something. ``` ## Request for feedback Thoughts @alamb @tustvold @teh-cmc ? If the `fn shrink_to_fit(&self) -> ArrayRef` plan sounds good, I can start implementing it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
