Re: [PR] Use `Arc<[Buffer]>` instead of raw `Vec` in `GenericByteViewArray` for faster `slice` [arrow-rs]

via GitHub Fri, 22 Nov 2024 08:32:31 -0800


XiangpengHao commented on PR #6427:
URL: https://github.com/apache/arrow-rs/pull/6427#issuecomment-2494162875


   > I wonder if kernels are blindly concatenating identical buffers together, 
instead of using something like Buffer::ptr_eq to avoid a new entry for the 
exact same buffer allocation?
   
   I think so: 
https://github.com/apache/arrow-rs/blob/def94a839236f3b04727a07c378668c9ada807f0/arrow-data/src/transform/mod.rs#L630-L637
   
   >  Someone popped up on discord the other day reporting a StringViewArray 
with ~10k buffers, this would suggest something is likely off somewhere.
   
   I have experienced this when loading string view from Parquet. If the 
parquet data has 10k buffers of string data, the string view will just hold 
them. Typically we should run a filter and then 
[gc](https://docs.rs/arrow/latest/arrow/array/type.StringViewArray.html#method.gc)
 it. This is handled in DF but if used out side DF users might need to do 
something similar like this: 
https://github.com/apache/datafusion/blob/c0ca4b4e449e07c3bcd6f3593fa31dd31ed5e0c5/datafusion/physical-plan/src/coalesce/mod.rs#L201-L221
   
   In other words, if StringViewArray is constructed by us, it's very unlikely 
to have 10k buffers as we will exponentially grow the buffers ultil 2MB; so 10k 
buffers means ~20GB data


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Use `Arc<[Buffer]>` instead of raw `Vec` in `GenericByteViewArray` for faster `slice` [arrow-rs]

Reply via email to