BinaryView` when utf8 validation disabled [arrow-rs]

via GitHub Wed, 16 Jul 2025 00:41:39 -0700


ding-young commented on PR #7917:
URL: https://github.com/apache/arrow-rs/pull/7917#issuecomment-3077375154


   - cargo bench result 
   
   | Case (str_len, null prob)                   | main         | issue-6057   |
   |---------------------------|--------------|--------------|
   | string view(10, 0)        | 51.23 µs     | 52.18 µs     | 
   | string view(30, 0)        | 45.47 µs     | 46.63 µs     | 
   | string view(100, 0)       | 64.18 µs     | 68.54 µs     |
   | string view(100, 0.5)     | 70.11 µs     | 74.06 µs     | 
   | string view(1..100, 0)    | 100.72 µs    | 103.80 µs    | 
   | string view(1..100, 0.5)  | 80.48 µs     | 86.02 µs     | 
   
   - manual memory profiling result (*unit = B)
   
   I added code to get jemalloc stats (allocate, resident, active) before and 
after decoding binary view, and the memory usage actually improved especially 
when short strings are mixed up with large strings. When given rows consists of 
only large strings, the memory usage was the same. 
   ```rust
   let before = jemalloc_stat();
   
   let view = if !validate_utf8 {
       decode_binary_view_inner_utf8_unchecked(rows, options)
   } else {
       decode_binary_view_inner(rows, options, validate_utf8)
   };
   
   let after = jemalloc_stat();
   // print ( after - before ) 
   ```
   
   (To reproduce, see 
https://github.com/ding-young/arrow-rs/tree/issue-6057-bench-mem ) 
   
   | Case                      | main (alloc / active) | issue-6057 (alloc / 
active) | 
   
|---------------------------|----------------------|-----------------------------|
   | string view(10, 0)        | **102656 / 114688**      | **65536 / 69632**   
            | 
   | string view(30, 0)        | 196608 / 204800      | 196608 / 204800         
    |
   | string view(100, 0)       | 524288 / 532480      | 524288 / 532480         
    | 
   | string view(100, 0.5)     | 294912 / 303104      | 294912 / 303104         
    | 
   | string view(1..100, 0)    | 294912 / 303104      | 294912 / 303104         
    |
   | string view(1..100, 0.5)  | **180224 / 188416**      | **163840 / 172032** 
            | 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Improve memory usage for `arrow-row -> String/BinaryView` when utf8 validation disabled [arrow-rs]

Reply via email to