Thanks reviewing! I agree with your early return strategy and I would reconsider that condition below.
# HG changeset patch # User Yugo Horie <u5.ho...@gmail.com> # Date 1677107390 -32400 # Thu Feb 23 08:09:50 2023 +0900 # Node ID a3ca45d39fcfd32ca92a6bd25ec18b6359b90f1a # Parent f4653576ffcd286bed7229e18ee30ec3c713b4de Core: restrict the rule of utf-8 decode. The first byte being above 0xf8 which is referred to 5byte over length older utf-8 becomes invalid. Even the range of the first byte from 0xf5 to 0xf7 is valid in the term of the codepoint decoding. See https://datatracker.ietf.org/doc/html/rfc3629#section-4. diff -r f4653576ffcd -r a3ca45d39fcf src/core/ngx_string.c --- a/src/core/ngx_string.c Thu Feb 23 07:56:44 2023 +0900 +++ b/src/core/ngx_string.c Thu Feb 23 08:09:50 2023 +0900 @@ -1363,8 +1363,12 @@ uint32_t u, i, valid; u = **p; - - if (u >= 0xf0) { + if (u >= 0xf8) { + + (*p)++; + return 0xffffffff; + + } else if (u >= 0xf0) { u &= 0x07; valid = 0xffff; On Thu, Feb 23, 2023 at 6:41 Maxim Dounin <mdou...@mdounin.ru> wrote: > Hello! > > On Thu, Feb 23, 2023 at 12:10:27AM +0900, u5h wrote: > > > # HG changeset patch > > # User Yugo Horie <u5.ho...@gmail.com> > > # Date 1677077775 -32400 > > # Wed Feb 22 23:56:15 2023 +0900 > > # Node ID 1a9487706c6af90baf2ed770db29f689c3850721 > > # Parent cffaf3f2eec8fd33605c2a37814f5ffc30371989 > > core: return error when the first byte is above 0xf5 in utf-8 > > > > * see https://datatracker.ietf.org/doc/html/rfc3629#section-4 > > > > Thanks for catching, this needs to be addressed. > > The existing code accepts invalid UTF-8 sequences with the first > byte being 11111xxx (>= 0xf8), and interprets them as if the first > byte was 11110xxx (as a 4-byte sequence). This is going to > misinterpret future UTF-8 if it will be ever expanded to 5-byte > sequences (unlikely though), and might cause other problems > (including security ones, as the RFC rightfully warns, though also > unlikely in this particular case). > > A more verbose commit log might be beneficial here, as well as > the one matching nginx style for commit logs. See > http://nginx.org/en/docs/contributing_changes.html for some basic > tips. > > > diff -r cffaf3f2eec8 -r 1a9487706c6a src/core/ngx_string.c > > --- a/src/core/ngx_string.c Thu Feb 02 23:38:48 2023 +0300 > > +++ b/src/core/ngx_string.c Wed Feb 22 23:56:15 2023 +0900 > > @@ -1364,7 +1364,7 @@ > > > > u = **p; > > > > - if (u >= 0xf0) { > > + if (u < 0xf5 && u >= 0xf0) { > > > > u &= 0x07; > > valid = 0xffff; > > The suggested change doesn't look correct to me. > > In particular, the "u == 0xf5" (as well as 0xf6 and 0xf7) case is > valid in terms of the codepoint decoding, though resulting value > will be in the invalid range as per the comment above the > function. So it does not really need any special handling, except > may be to simplify things. The interesting part here is "u >= 0xf8" > (which is complimentary to "u &= 0x07" being used in "if"). > > Further, as the condition is written, "u >= 0xf5" is now excluded > from the first "if", but will be matched by the "if (u >= 0xe0)" > below, and misinterpreted as a 3-byte sequence. This looks worse > than the existing behaviour. > > I would rather consider something like: > > diff --git a/src/core/ngx_string.c b/src/core/ngx_string.c > --- a/src/core/ngx_string.c > +++ b/src/core/ngx_string.c > @@ -1364,7 +1364,12 @@ ngx_utf8_decode(u_char **p, size_t n) > > u = **p; > > - if (u >= 0xf0) { > + if (u >= 0xf8) { > + > + (*p)++; > + return 0xffffffff; > + > + } else if (u >= 0xf0) { > > u &= 0x07; > valid = 0xffff; > > > -- > Maxim Dounin > http://mdounin.ru/ > _______________________________________________ > nginx-devel mailing list > nginx-devel@nginx.org > https://mailman.nginx.org/mailman/listinfo/nginx-devel >
_______________________________________________ nginx-devel mailing list nginx-devel@nginx.org https://mailman.nginx.org/mailman/listinfo/nginx-devel