In perl.git, the branch blead has been updated <https://perl5.git.perl.org/perl.git/commitdiff/ee0ff0f58536ba7975a4b8f1d21309ae9f451df7?hp=a578d0f3e37a8500429796cdeeba96dbba029778>
- Log ----------------------------------------------------------------- commit ee0ff0f58536ba7975a4b8f1d21309ae9f451df7 Author: Karl Williamson <[email protected]> Date: Wed Oct 9 10:02:31 2019 -0600 Add UTF8_CHK_SKIP() macro This is a safer version of UTF8SKIP for use when the input could be possibly malformed. It uses strnlen() to not read past a NUL in the input. Since Perl adds NULs to the end of SV's, this will likely prevent reading beyond the end of a buffer. A still safer version could be written that doesn't look for just a NUL, but any unexpected byte, and stops just before that. I suspect that is overkill, and since strnlen() can be very fast, I went with this approach instead. Nothing precludes adding another version that does this full checking commit a281f16cacceabade4e75fbbbeb567285d462ba0 Author: Karl Williamson <[email protected]> Date: Wed Oct 9 10:01:32 2019 -0600 Document UTF8_SKIP() commit bd350c85f2b40fbbcd57c61670e9aff330675586 Author: Karl Williamson <[email protected]> Date: Wed Oct 9 09:39:27 2019 -0600 Fix pod entry for toLOWER_utf8 It was missing a parameter ----------------------------------------------------------------------- Summary of changes: handy.h | 2 +- utf8.h | 50 ++++++++++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 49 insertions(+), 3 deletions(-) diff --git a/handy.h b/handy.h index 8349fd1699..e89b43449d 100644 --- a/handy.h +++ b/handy.h @@ -1153,7 +1153,7 @@ The first code point of the lowercased version is returned (but note, as explained at L<the top of this section|/Character case changing>, that there may be more). -=for apidoc Am|UV|toLOWER_utf8|U8* p|U8* s|STRLEN* lenp +=for apidoc Am|UV|toLOWER_utf8|U8* p|U8* e|U8* s|STRLEN* lenp Converts the first UTF-8 encoded character in the sequence starting at C<p> and extending no further than S<C<e - 1>> to its lowercase version, and stores that in UTF-8 in C<s>, and its length in bytes in C<lenp>. Note diff --git a/utf8.h b/utf8.h index 889324e587..83cccf16c3 100644 --- a/utf8.h +++ b/utf8.h @@ -530,15 +530,61 @@ encoded as UTF-8. C<cp> is a native (ASCII or EBCDIC) code point if less than /* =for apidoc Am|STRLEN|UTF8SKIP|char* s -returns the number of bytes in the UTF-8 encoded character whose first (perhaps -only) byte is pointed to by C<s>. +returns the number of bytes a non-malformed UTF-8 encoded character whose first +(perhaps only) byte is pointed to by C<s>. + +If there is a possibility of malformed input, use instead: + +=over + +=item L</C<UTF8_SAFE_SKIP>> if you know the maximum ending pointer in the +buffer pointed to by C<s>; or + +=item L</C<UTF8_CHK_SKIP>> if you don't know it. + +=back + +It is better to restructure your code so the end pointer is passed down so that +you know what it actually is at the point of this call, but if that isn't +possible, L</C<UTF8_CHK_SKIP>> can minimize the chance of accessing beyond the end +of the input buffer. =cut */ #define UTF8SKIP(s) PL_utf8skip[*(const U8*)(s)] + +/* +=for apidoc Am|STRLEN|UTF8_SKIP|char* s +This is a synonym for L</C<UTF8SKIP>> + +=cut +*/ + #define UTF8_SKIP(s) UTF8SKIP(s) /* +=for apidoc Am|STRLEN|UTF8_CHK_SKIP|char* s + +This is a safer version of L</C<UTF8SKIP>>, but still not as safe as +L</C<UTF8_SAFE_SKIP>>. This version doesn't blindly assume that the input +string pointed to by C<s> is well-formed, but verifies that there isn't a NUL +terminating character before the expected end of the next character in C<s>. +The length C<UTF8_CHK_SKIP> returns stops just before any such NUL. + +Perl tends to add NULs, as an insurance policy, after the end of strings in +SV's, so it is likely that using this macro will prevent inadvertent reading +beyond the end of the input buffer, even if it is malformed UTF-8. + +This macro is intended to be used by XS modules where the inputs could be +malformed, and it isn't feasible to restructure to use the safer +L</C<UTF8_SAFE_SKIP>>, for example when interfacing with a C library. + +=cut +*/ + +#define UTF8_CHK_SKIP(s) \ + (s[0] == '\0' ? 1 : MIN(my_strnlen((char *) (s), UTF8SKIP(s)))) +/* =for apidoc Am|STRLEN|UTF8_SAFE_SKIP|char* s|char* e returns 0 if S<C<s E<gt>= e>>; otherwise returns the number of bytes in the -- Perl5 Master Repository
