bkietz commented on a change in pull request #8470:
URL: https://github.com/apache/arrow/pull/8470#discussion_r505708835
##########
File path: cpp/src/arrow/util/utf8.h
##########
@@ -154,13 +156,52 @@ inline bool ValidateUTF8(const uint8_t* data, int64_t
size) {
return false;
}
- // Validate string tail one byte at a time
+ // Check if string tail is full ASCII (common case, fast)
+ if (size >= 4) {
+ uint32_t mask1, mask2;
+ memcpy(&mask2, data + size - 4, 4);
+ memcpy(&mask1, data, 4);
+ if (ARROW_PREDICT_TRUE(((mask1 | mask2) & high_bits_32) == 0)) {
+ return true;
+ }
Review comment:
```suggestion
uint32_t head_mask = internal::SafeLoadAs<uint32_t>(data);
uint32_t tail_mask = internal::SafeLoadAs<uint32_t>(data + size - 4);
if (ARROW_PREDICT_TRUE(((head_mask | tail_mask) & high_bits_32) == 0)) {
return true;
}
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]