alamb commented on PR #6154: URL: https://github.com/apache/arrow-rs/pull/6154#issuecomment-2256644528
Thanks @chloro-pn. 🙏 We have gone back and forth on this idea while integrating StringView into datafusion The StringViewArray has a `gc` method but this does require an extra copy of the views https://docs.rs/arrow/latest/arrow/array/struct.GenericByteViewArray.html#method.gc In fact @XiangpengHao used this API in https://github.com/apache/datafusion/pull/11587 to solve exactly the problem you are describing (too much unused data in the buffers) However, what I worry about is that the heuristic to determine when to compact the string data / buffers will not be ideal for any particular usecase and that one princple of this crate is to give the user maximal control over performance So I would like to propose we support two different modes for filter kernels: 1. Filter only views 2. Filter the views and copy matchings strings to a new buffer @XiangpengHao I wonder if you have any thoughts to add here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
