alamb opened a new issue, #5513: URL: https://github.com/apache/arrow-rs/issues/5513
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** This is part of the larger project to implement `StringViewArray` -- see https://github.com/apache/arrow-rs/issues/5374 In https://github.com/apache/arrow-rs/pull/5481 we added support for `StringViewArray` and `ByteViewArray`. This ticket tracks adding a `gc` method to `StringViewArray` and `ByteViewArray` After calling `filter` or `take` on a `StringViewArray` or `ByteViewArray` the backing variable length buffer may be much larger than necessary to store the results So before an array may look like the following with significant "garbage" space ``` ┌──────┐ │......│ │......│ ┌────────────────────┐ ┌ ─ ─ ─ ▶ │Data1 │ Large buffer │ View 1 │─ ─ ─ ─ │......│ with data that ├────────────────────┤ │......│ is not referred │ View 2 │─ ─ ─ ─ ─ ─ ─ ─▶ │Data2 │ to by View 1 or └────────────────────┘ │......│ View 2 │......│ 2 views, refer to │......│ small portions of a └──────┘ large buffer ``` After GC it should look like ``` ┌────────────────────┐ ┌─────┐ After gc, only │ View 1 │─ ─ ─ ─ ─ ─ ─ ─▶ │Data1│ data that is ├────────────────────┤ ┌ ─ ─ ─ ▶ │Data2│ pointed to by │ View 2 │─ ─ ─ ─ └─────┘ the views is └────────────────────┘ left 2 views ``` **Describe the solution you'd like** I would like to add a method called `StringViewArray::gc` (and `ByteViewArray::gc`) that will compact I expect users of the arrow crates to invoke this function, not any of the arrow kernels themselves **Describe alternatives you've considered** We could also add the `gc` functionality as its own standalone kernel (e.g. `kernels::gc`) rather than a method on the array. **Additional context** This GC is what is described in https://pola.rs/posts/polars-string-type/ > What I consider the biggest downside is that we have to do garbage collection. When we gather/filter from an array with allocated long strings, we might keep strings alive that are not used anymore. This requires us to use some heuristics to determine when we will do a garbage collection pass on the string column. And because they are heuristics, sometimes they will be unneeded. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
