BinaryView memory usage [arrow-rs]

via GitHub Mon, 15 Jul 2024 04:04:52 -0700


alamb opened a new issue, #6057:
URL: https://github.com/apache/arrow-rs/issues/6057


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   Part of https://github.com/apache/arrow-rs/issues/5374
   
   @XiangpengHao implemented optimized row format --> ByteView (StringView / 
BinaryView) encoding/decoding in https://github.com/apache/arrow-rs/issues/5945 
/ https://github.com/apache/arrow-rs/pull/6044
   
   It also adds benchmarks so we can test🎉 
   
   However, as mentioned in 
https://github.com/apache/arrow-rs/pull/6044/files#r1676803119 the output array 
in https://github.com/apache/arrow-rs/pull/6044 will have both short and long 
strings  even though only the long strings are used in the view definition (the 
short strings are included to do fast utf8 validation)
   
   This results in more memory used for the output array than neccessary
   
   **Describe the solution you'd like**
   
   reduce memory required by output array
   
   
   **Describe alternatives you've considered**
   One idea is to use a separate utf8 validation buffer for short strings, 
similarly to
   
   
https://github.com/apache/arrow-rs/blob/0002b4ded7cfffbf46c85e2fac0b4f9a545d0f55/parquet/src/arrow/array_reader/byte_view_array.rs#L623-L668
   
   **Additional context**
   <!--
   Add any other context or screenshots about the feature request here.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Improve arrow-row --> StringView/BinaryView memory usage [arrow-rs]

Reply via email to