XiangpengHao commented on PR #6427: URL: https://github.com/apache/arrow-rs/pull/6427#issuecomment-2494162875
> I wonder if kernels are blindly concatenating identical buffers together, instead of using something like Buffer::ptr_eq to avoid a new entry for the exact same buffer allocation? I think so: https://github.com/apache/arrow-rs/blob/def94a839236f3b04727a07c378668c9ada807f0/arrow-data/src/transform/mod.rs#L630-L637 > Someone popped up on discord the other day reporting a StringViewArray with ~10k buffers, this would suggest something is likely off somewhere. I have experienced this when loading string view from Parquet. If the parquet data has 10k buffers of string data, the string view will just hold them. Typically we should run a filter and then [gc](https://docs.rs/arrow/latest/arrow/array/type.StringViewArray.html#method.gc) it. This is handled in DF but if used out side DF users might need to do something similar like this: https://github.com/apache/datafusion/blob/c0ca4b4e449e07c3bcd6f3593fa31dd31ed5e0c5/datafusion/physical-plan/src/coalesce/mod.rs#L201-L221 In other words, if StringViewArray is constructed by us, it's very unlikely to have 10k buffers as we will exponentially grow the buffers ultil 2MB; so 10k buffers means ~20GB data -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
