[I] Improve speed of row converter by skipping utf8 checks [arrow-rs]

via GitHub Mon, 15 Jul 2024 04:09:34 -0700


alamb opened a new issue, #6058:
URL: https://github.com/apache/arrow-rs/issues/6058


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   Part of https://github.com/apache/arrow-rs/issues/5374
   
   @XiangpengHao implemented optimized row format --> ByteView (StringView / 
BinaryView) encoding/decoding in https://github.com/apache/arrow-rs/issues/5945 
/ https://github.com/apache/arrow-rs/pull/6044
   
   It also adds benchmarks so we can test🎉 
   
   However, as mentioned in 
https://github.com/apache/arrow-rs/pull/6044/files#r1676804033 if we know that 
the `Row` value was created from valid utf8 values, re-validating utf8 is 
unnecessary.
   
   **Describe the solution you'd like**
   
   Consider an API that would allow skipping utf8 validation
   
   This would need to be justified by performance benchmarks showing it made a 
significant difference in performance
   
   **Describe alternatives you've considered**
   
   Perhaps it would be an `unsafe` option on the 
[RowConverter](https://docs.rs/arrow-row/52.1.0/arrow_row/struct.RowConverter.html)
   
   ```rust
   let converter = RowConverter::new(...);
   
   // Safety: only decoding Rows that came from valid String arrays
   let converter = unsafe {
     converter.with_validate_utf8(false)
   }
   ```
   
   
   
   **Additional context**
   <!--
   Add any other context or screenshots about the feature request here.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Improve speed of row converter by skipping utf8 checks [arrow-rs]

Reply via email to