On Tue, 13 Jul 2021 at 01:15, John Naylor <john.nay...@enterprisedb.com> wrote: > > It seems like it would be easy to have pg_utf8_verify_one in my proposed > > pg_utf8.h header and replace the body of pg_utf8_verifychar with it. > > 0001: I went ahead and tried this for v15, and also attempted some clean-up: > > - Rename pg_utf8_verify_one to pg_utf8_verifychar_internal. > - Have pg_utf8_verifychar_internal return -1 for invalid input to match other > functions in the file. We could also do this for check_ascii, but it's not > quite the same thing, because the string could still have valid bytes in it, > just not enough to advance the pointer by the stride length. > - Remove hard-coded numbers (not wedded to this). > > - Use a call to pg_utf8_verifychar in the slow path. > - Reduce pg_utf8_verifychar to thin wrapper around > pg_utf8_verifychar_internal.
- check_ascii() seems to be used only for 64-bit chunks. So why not remove the len argument and the len <= sizeof(int64) checks inside the function. We can rename it to check_ascii64() for clarity. - I was thinking, why not have a pg_utf8_verify64() that processes 64-bit chunks (or a 32-bit version). In check_ascii(), we anyway extract a 64-bit chunk from the string. We can use the same chunk to extract the required bits from a two byte char or a 4 byte char. This way we can avoid extraction of separate bytes like b1 = *s; b2 = s[1] etc. More importantly, we can avoid the separate continuation-char checks for each individual byte. Additionally, we can try to simplify the subsequent overlong or surrogate char checks. Something like this : int pg_utf8_verifychar_32(uint32 chunk) { int len, l; for (len = sizeof(chunk); len > 0; (len -= l), (chunk = chunk << l)) { /* Is 2-byte lead */ if ((chunk & 0xF0000000) == 0xC0000000) { l = 2; /* ....... ....... */ } /* Is 3-byte lead */ else if ((chunk & 0xF0000000) == 0xE0000000) { l = 3; if (len < l) break; /* b2 and b3 should be continuation bytes */ if ((chunk & 0x00C0C000) != 0x00808000) return sizeof(chunk) - len; switch (chunk & 0xFF200000) { /* check 3-byte overlong: 1110.0000 1001.xxxx 10xx.xxxx * i.e. (b1 == 0xE0 && b2 < 0xA0). We already know b2 is of the form * 10xx since it's a continuation char. Additionally condition b2 <= * 0x9F means it is of the form 100x.xxxx. i.e. either 1000.xxxx * or 1001.xxxx. So just verify that it is xx0x.xxxx */ case 0xE0000000: return sizeof(chunk) - len; /* check surrogate: 1110.1101 101x.xxxx 10xx.xxxx * i.e. (b1 == 0xED && b2 > 0x9F): Here, > 0x9F means either * 1010.xxxx, 1011.xxxx, 1100.xxxx, or 1110.xxxx. Last two are not * possible because b2 is a continuation char. So it has to be * first two. So just verify that it is xx1x.xxxx */ case 0xED200000: return sizeof(chunk) - len; default: ; } } /* Is 4-byte lead */ else if ((chunk & 0xF0000000) == 0xF0000000) { /* ......... */ l = 4; } else return sizeof(chunk) - len; } return sizeof(chunk) - len; }