Jefffrey commented on issue #7186:
URL: https://github.com/apache/arrow-rs/issues/7186#issuecomment-4843376782
I'm thinking along the lines like this:
```rust
// default = all true
struct MinifyOptions {
// general
minimize_all_buffers: bool,
recursive: bool,
// specialized
zero_size_nulls: bool,
compact_views: bool,
propagate_null_mask: bool,
deduplicate_dictionary: bool,
compact_runs: bool,
compact_dense_union: bool,
}
```
### minimize_all_buffers
If any buffer that the array holds (values, null buffer) is sliced, copy
into a new buffer with only the sliced portion to get rid of unreferenced data.
Also can do minor optimization to change null buffer from `Some` to `None` if
null count is 0 (could happen if slicing).
### recursive
If array is nested, apply same minify options to all children. This could
get a bit tricky, but I'll get to that below.
### zero_size_nulls
For byte/list/map types, pretty much what
https://github.com/apache/arrow-rs/pull/9970 is doing.
### compact_views
Performing gc on list/byte view types.
### propagate_null_mask
For struct & fixed size list, for the null slots in this parent types we can
try rebuild the children to have nulls there too to ensure minimum possible
memory footprint. For example if the child is a string array, we can ensure its
null (or at least zero sized) where the parents have null slots.
### deduplicate_dictionary
For dictionaries, deduplicate the values and also remove any nulls in the
values array (and instead specify at key level).
### compact_runs
For run arrays, ensure we have minimum possible runs (in case we have an
array with two runs of the same value in succession).
### compact_dense_union
Ensure children arrays & offsets are minimal possible, similar to views.
## Interaction of options
Each of the specialized options are mutually exclusive from the others
(without considering recursion yet). The only option that would affect them all
would be `minimize_all_buffers`. It would be possible to have
`minimize_all_buffers: false` but still specify `zero_size_nulls: true`. This
would still have to rebuild some buffers of the variable types, but not
necessarily all of the buffers (e.g. can copy null buffer as is), which is why
the general option specifies **all** buffers.
Recursive option is a bit tricky since some of the specialized options would
need to rebuild the child anyway, so need to ensure rebuild in proper order.
e.g. if zero sizing nulls, could zero size nulls first then again minify the
children (rebuild children twice) or minify children first and rely on that
minification to automatically zero size the nulls, etc.
## Overall
Personally I'm not sure how concerned we should be about the granularity of
the controls on this minification behaviour; I assume most use cases would be
fine with defaults 🤔
In terms of overall memory footprint, as long as the docstring properly
explains that whilst it can create an equivalent array with a smaller memory
footprint, it does not necessarily reduce overall memory footprint since it
does copy data into new buffers.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]