Hello! On Thu, Feb 23, 2023 at 12:10:27AM +0900, u5h wrote:
> # HG changeset patch > # User Yugo Horie <u5.ho...@gmail.com> > # Date 1677077775 -32400 > # Wed Feb 22 23:56:15 2023 +0900 > # Node ID 1a9487706c6af90baf2ed770db29f689c3850721 > # Parent cffaf3f2eec8fd33605c2a37814f5ffc30371989 > core: return error when the first byte is above 0xf5 in utf-8 > > * see https://datatracker.ietf.org/doc/html/rfc3629#section-4 > Thanks for catching, this needs to be addressed. The existing code accepts invalid UTF-8 sequences with the first byte being 11111xxx (>= 0xf8), and interprets them as if the first byte was 11110xxx (as a 4-byte sequence). This is going to misinterpret future UTF-8 if it will be ever expanded to 5-byte sequences (unlikely though), and might cause other problems (including security ones, as the RFC rightfully warns, though also unlikely in this particular case). A more verbose commit log might be beneficial here, as well as the one matching nginx style for commit logs. See http://nginx.org/en/docs/contributing_changes.html for some basic tips. > diff -r cffaf3f2eec8 -r 1a9487706c6a src/core/ngx_string.c > --- a/src/core/ngx_string.c Thu Feb 02 23:38:48 2023 +0300 > +++ b/src/core/ngx_string.c Wed Feb 22 23:56:15 2023 +0900 > @@ -1364,7 +1364,7 @@ > > u = **p; > > - if (u >= 0xf0) { > + if (u < 0xf5 && u >= 0xf0) { > > u &= 0x07; > valid = 0xffff; The suggested change doesn't look correct to me. In particular, the "u == 0xf5" (as well as 0xf6 and 0xf7) case is valid in terms of the codepoint decoding, though resulting value will be in the invalid range as per the comment above the function. So it does not really need any special handling, except may be to simplify things. The interesting part here is "u >= 0xf8" (which is complimentary to "u &= 0x07" being used in "if"). Further, as the condition is written, "u >= 0xf5" is now excluded from the first "if", but will be matched by the "if (u >= 0xe0)" below, and misinterpreted as a 3-byte sequence. This looks worse than the existing behaviour. I would rather consider something like: diff --git a/src/core/ngx_string.c b/src/core/ngx_string.c --- a/src/core/ngx_string.c +++ b/src/core/ngx_string.c @@ -1364,7 +1364,12 @@ ngx_utf8_decode(u_char **p, size_t n) u = **p; - if (u >= 0xf0) { + if (u >= 0xf8) { + + (*p)++; + return 0xffffffff; + + } else if (u >= 0xf0) { u &= 0x07; valid = 0xffff; -- Maxim Dounin http://mdounin.ru/ _______________________________________________ nginx-devel mailing list nginx-devel@nginx.org https://mailman.nginx.org/mailman/listinfo/nginx-devel