alamb commented on issue #6906: URL: https://github.com/apache/datafusion/issues/6906#issuecomment-2355629355
# Potential Design One high level idea is to build a data structure that uses the same internal format (views/buffers) as [`StringViewArray`](https://docs.rs/arrow/latest/arrow/array/struct.GenericByteViewArray.html) in Arrow: ``` ┌───────────────────────────────────────────┐ │ Stored in Vec<u8> │ │ Stored in a ┌─────────────────┐ │ │ Vec<u128> │ ┌───────────┐ │ │ │ ┌─────────────┐ ─ ─│▶│some value │ │ │ ┌─────┐ │ │ ┌─────────┐ │ │ │ └───────────┘ │ │ │ 0 │────────┼─┼▶│ View │ │ │ │ │ ├─────┤ │ │ ├─────────┤ │ │ │ │ │ │ 1 │────────┼─┼▶│ View │─│─ ─ ┐ │ ... │ │ └─────┘ │ │ └─────────┘ │ │ │ │ │ ... │ │ ... │ │ │ │ │ ┌─────┐ │ │ ┌─────────┐ │ │ │ │ │ │ N-2 │ │ │ │ View │─│─ ─ │ │ ┌────────────┐│ │ ├─────┤ │ │ ├─────────┤ │ ─ ┼ ─▶│other value ││ │ │ N-1 │────────┼─┼▶│ View │ │ │ └────────────┘│ │ └─────┘ │ │ └─────────┘ │ └─────────────────┘ │ │ └─────────────┘ String values are stored │ │ inline or in extra byte │ Logical group │ buffer │ number └───────────────────────────────────────────┘ New structure: MutableStringViewBuilder Current Min/Max value for that group stored in same format as StringViewArray ``` In this design, the current value for each group is stored in two parts (as described on [arrow docs](https://docs.rs/arrow/latest/arrow/array/struct.GenericByteViewArray.html)) 1. a fixed size `u128` view 2. a variable length part with the string data As new batches are updated, each `View` is updated if necessary Benefits of design: 1. Hopefully use the same code as in arrow-rs 4. Allows Zero copy conversion to StringView / BinaryView at output 5. Use inlined values for quick min/max comparison I believe (though we will have to verify it) that the conversion from MutableStringViewBuilder to just StringArray (not StringViewArray) should also be better than the current `GroupsAccumulatorAdapter`. Both conversions need to copy the string bytes again into the packed `StringArray` format, but the `GroupsAccumulatorAdapter` also has to allocate/free owned `String`s as well ## Potential challenges I think the trickiest part of this code, other the low level code optimizations is that as min/max values are replaced, data in the variable length buffer will become "garbage" (not reachable) thus consuming more memory than necessary: ``` Stored in Vec<u8> ┌────────────────────┐ │ ┌────────────────┐ │ ┌─────────────┐ │ │prev max value 1│ │ │ ┌─────────┐ │ │ └────────────────┘ │ │ │ View │─│─ ┐ │ ... │ │ └─────────┘ │ │ ┌────────────────┐ │ │ ... │ │ │ │prev max value m│ │ │ │ │ └────────────────┘ │ │ │ │ │ ┌────────────────┐ │ │ │ ─ ─│▶│prev max value m│ │ │ │ │ └────────────────┘ │ │ │ │ ... │ │ │ └────────────────────┘ │ │ └─────────────┘ Previous min/max values are not pointed to anymore and need to be cleaned up ``` I think this means the code will need something [`GenericByteViewArray::gc`](https://docs.rs/arrow/latest/arrow/array/struct.GenericByteViewArray.html#method.gc) run occasionally ## Random Thoughts Thoughts: maybe this structure (MutableStringViewBuilder??) could be upstreamed eventually -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org