[perl.git] branch blead, updated. v5.25.4-167-g178122f

Karl Williamson Sat, 17 Sep 2016 20:12:07 -0700

In perl.git, the branch blead has been updated

<http://perl5.git.perl.org/perl.git/commitdiff/178122f6e0bac976a9e16e3553b9469414d70f3c?hp=d566bd20c27a46aecd668d2f739b9515f46ac74f>


- Log -----------------------------------------------------------------
commit 178122f6e0bac976a9e16e3553b9469414d70f3c
Author: Karl Williamson <k...@cpan.org>
Date:   Sat Sep 17 21:07:29 2016 -0600

    perldelta for new Unicode-handling function.

M       pod/perldelta.pod

commit 82c5d9411d56e20ecc88026f926f533fe72d6b4d
Author: Karl Williamson <k...@cpan.org>
Date:   Wed Sep 14 19:49:52 2016 -0600

    perlapi: Clarify docs for some is_utf8_foo functions

M       inline.h

commit 25e3a4e08a8b645de44458470ff4f139baf56682
Author: Karl Williamson <k...@cpan.org>
Date:   Wed Sep 14 18:54:23 2016 -0600

    Add isUTF8_CHAR_flags() macro
    
    This is like the previous 2 commits, but the macro takes a flags
    parameter so any combination of the disallowed flags may be used.  The
    others, along with the original isUTF8_CHAR(), are the most commonly
    desired strictures, and use an implementation of a, hopefully, inlined
    trie for speed.  This is for generality and the major portion of its
    implementation isn't inlined.

M       ext/XS-APItest/APItest.xs
M       ext/XS-APItest/t/utf8.t
M       utf8.h

commit a82be82b512232b63f28c5865113f7990fb59a3a
Author: Karl Williamson <k...@cpan.org>
Date:   Mon Sep 12 16:52:41 2016 -0600

    Add macro for Unicode Corregindum #9 strict
    
    This macro follows Unicode Corrigendum #9 to allow non-character code
    points.  These are still discouraged but not completely forbidden.
    
    It's best for code that isn't intended to operate on arbitrary other
    code text to use the original definition, but code that does things,
    such as source code control, should change to use this definition if it
    wants to be Unicode-strict.
    
    Perl can't adopt C9 wholesale, as it might create security holes in
    existing applications that rely on Perl keeping non-chars out.

M       ext/XS-APItest/APItest.xs
M       ext/XS-APItest/t/utf8.t
M       regcharclass.h
M       regen/regcharclass.pl
M       utf8.h
M       utfebcdic.h

commit e23e8bc1957a5981b8a507b62471ae38ec06c661
Author: Karl Williamson <k...@cpan.org>
Date:   Mon Sep 12 13:38:22 2016 -0600

    Add macro for determining if UTF-8 is Unicode-strict

M       ext/XS-APItest/APItest.xs
M       ext/XS-APItest/t/utf8.t
M       regcharclass.h
M       regen/regcharclass.pl
M       utf8.h
M       utfebcdic.h

commit 2c6ed66c0652679e56178882f052322c3fe69a8f
Author: Karl Williamson <k...@cpan.org>
Date:   Mon Sep 12 14:30:15 2016 -0600

    perlapi: Clarify isUTF8_CHAR()

M       utf8.h

commit c41b2540e92ea1e5c8c0019b8a4c085c5fd741e8
Author: Karl Williamson <k...@cpan.org>
Date:   Wed Sep 14 17:09:51 2016 -0600

    inline.h: Add 'const's; avoid hiding outer variable
    
    This changes some formal parameters to be const, and avoids reusing the
    same variable name within an inner block, to avoid confusion

M       embed.fnc
M       inline.h
M       mathoms.c
M       proto.h

commit 3d56ecbe82b99d21cf2f5e67297d4236e38b282d
Author: Karl Williamson <k...@cpan.org>
Date:   Thu Sep 8 11:34:15 2016 -0600

    Add tests for is_valid_partial_utf8_char_flags()

M       ext/XS-APItest/APItest.xs
M       ext/XS-APItest/t/utf8.t

commit f1c999a79ad93bb81cbb7b1bec96e06c33773b81
Author: Karl Williamson <k...@cpan.org>
Date:   Sun Sep 11 22:18:57 2016 -0600

    Add is_utf8_valid_partial_char_flags()
    
    This is a generalization of is_utf8_valid_partial_char to allow the
    caller to automatically exclude things such as surrogates.

M       embed.fnc
M       embed.h
M       inline.h
M       proto.h

commit 6cbb924831d50981620d4c51f8b12da5f269e569
Author: Karl Williamson <k...@cpan.org>
Date:   Sun Sep 11 09:40:37 2016 -0600

    perlapi: Reword description of is_utf8_valid_partial_char

M       inline.h

commit 8875bd4810684429d98f935fba9e6e016f1b9ca7
Author: Karl Williamson <k...@cpan.org>
Date:   Sat Sep 10 22:27:37 2016 -0600

    Fix off-by-one error in is_utf8_valid_partial_char()

M       inline.h

commit 085b753446f9c21f09e86beb916fb5b857faf36d
Author: Karl Williamson <k...@cpan.org>
Date:   Sat Sep 10 22:24:48 2016 -0600

    handy.h: Comment memEQs and memNEs

M       handy.h

commit f2bf18ccb4cce9b47fd2180d0348f08e8cbbd663
Author: Karl Williamson <k...@cpan.org>
Date:   Sat Sep 10 22:18:59 2016 -0600

    utf8.c: Add some UNLIKELYs

M       utf8.c

commit df863e438c2ab516c61ce7abbf3f8afdbdf56e7e
Author: Karl Williamson <k...@cpan.org>
Date:   Sat Sep 10 22:18:16 2016 -0600

    utf8.h: Add comment, white-space changes

M       utf8.h

commit 2b47960981adadbe81b9635d4ca7861c45ccdced
Author: Karl Williamson <k...@cpan.org>
Date:   Sat Sep 10 22:09:44 2016 -0600

    Enhance and rename is_utf8_char_slow()
    
    This changes the name of this helper function and adds a parameter and
    functionality to allow it to exclude problematic classes of code
    points, the same ones excludeable by utf8n_to_uvchar(), like surrogates
    or non-character code points.

M       embed.fnc
M       embed.h
M       inline.h
M       proto.h
M       utf8.c
M       utf8.h
-----------------------------------------------------------------------

Summary of changes:
 embed.fnc                 |  13 +-
 embed.h                   |   4 +-
 ext/XS-APItest/APItest.xs |  30 +++++
 ext/XS-APItest/t/utf8.t   | 298 +++++++++++++++++++++++++++++++++++++++++++++-
 handy.h                   |   1 +
 inline.h                  | 112 ++++++++++-------
 mathoms.c                 |   2 +-
 pod/perldelta.pod         |   4 +-
 proto.h                   |  19 +--
 regcharclass.h            |   2 +-
 regen/regcharclass.pl     |  53 +++++++++
 utf8.c                    | 142 ++++++++++++++++++----
 utf8.h                    | 197 +++++++++++++++++++++++++++---
 utfebcdic.h               | 190 +++++++++++++++++++++++++++--
 14 files changed, 956 insertions(+), 111 deletions(-)

diff --git a/embed.fnc b/embed.fnc
index 43ed918..983c3ac 100644
--- a/embed.fnc
+++ b/embed.fnc
@@ -682,7 +682,7 @@ ApR |I32    |is_lvalue_sub
 : Used in cop.h
 XopR   |I32    |was_lvalue_sub
 #ifndef PERL_NO_INLINE_FUNCTIONS
-ApMRnP |STRLEN |_is_utf8_char_slow|NN const U8 * const s|const STRLEN len
+ApMRnP |STRLEN |_is_utf8_char_helper|NN const U8 * const s|NN const U8 * const 
e|const U32 flags
 #endif
 ADMpPR |U32    |to_uni_upper_lc|U32 c
 ADMpPR |U32    |to_uni_title_lc|U32 c
@@ -741,10 +741,13 @@ AmnpdRP   |bool   |is_ascii_string|NN const U8* const 
s|const STRLEN len
 AmnpdRP        |bool   |is_invariant_string|NN const U8* const s|const STRLEN 
len
 AnpdD  |STRLEN |is_utf8_char   |NN const U8 *s
 Abmnpd |STRLEN |is_utf8_char_buf|NN const U8 *buf|NN const U8 *buf_end
-AnipdP |bool   |is_utf8_string |NN const U8 *s|STRLEN len
-Anpdmb |bool   |is_utf8_string_loc|NN const U8 *s|STRLEN len|NN const U8 **ep
-Anipd  |bool   |is_utf8_string_loclen|NN const U8 *s|STRLEN len|NULLOK const 
U8 **ep|NULLOK STRLEN *el
-AnidP  |bool   |is_utf8_valid_partial_char|NN const U8 * const s|NN const U8 * 
const e
+AnipdP |bool   |is_utf8_string |NN const U8 *s|const STRLEN len
+Anpdmb |bool   |is_utf8_string_loc|NN const U8 *s|const STRLEN len|NN const U8 
**ep
+Anipd  |bool   |is_utf8_string_loclen|NN const U8 *s|const STRLEN len|NULLOK 
const U8 **ep|NULLOK STRLEN *el
+AmndP  |bool   |is_utf8_valid_partial_char                                 \
+               |NN const U8 * const s|NN const U8 * const e
+AnidP  |bool   |is_utf8_valid_partial_char_flags                           \
+               |NN const U8 * const s|NN const U8 * const e|const U32 flags
 AMpR   |bool   |_is_uni_FOO|const U8 classnum|const UV c
 AMpR   |bool   |_is_utf8_FOO|const U8 classnum|NN const U8 *p
 ADMpR  |bool   |is_utf8_alnum  |NN const U8 *p
diff --git a/embed.h b/embed.h
index 6a15a97..50a19a4 100644
--- a/embed.h
+++ b/embed.h
@@ -296,7 +296,7 @@
 #define is_utf8_string         Perl_is_utf8_string
 #define is_utf8_string_loclen  Perl_is_utf8_string_loclen
 #define is_utf8_upper(a)       Perl_is_utf8_upper(aTHX_ a)
-#define is_utf8_valid_partial_char     S_is_utf8_valid_partial_char
+#define is_utf8_valid_partial_char_flags       
S_is_utf8_valid_partial_char_flags
 #define is_utf8_xdigit(a)      Perl_is_utf8_xdigit(aTHX_ a)
 #define is_utf8_xidcont(a)     Perl_is_utf8_xidcont(aTHX_ a)
 #define is_utf8_xidfirst(a)    Perl_is_utf8_xidfirst(aTHX_ a)
@@ -787,7 +787,7 @@
 #define my_popen(a,b)          Perl_my_popen(aTHX_ a,b)
 #endif
 #if !defined(PERL_NO_INLINE_FUNCTIONS)
-#define _is_utf8_char_slow     Perl__is_utf8_char_slow
+#define _is_utf8_char_helper   Perl__is_utf8_char_helper
 #define append_utf8_from_native_byte   S_append_utf8_from_native_byte
 #define av_top_index(a)                S_av_top_index(aTHX_ a)
 #define cx_popblock(a)         S_cx_popblock(aTHX_ a)
diff --git a/ext/XS-APItest/APItest.xs b/ext/XS-APItest/APItest.xs
index d2c1c33..954bb60 100644
--- a/ext/XS-APItest/APItest.xs
+++ b/ext/XS-APItest/APItest.xs
@@ -5327,6 +5327,36 @@ test_isUTF8_CHAR(char *s, STRLEN len)
     OUTPUT:
         RETVAL
 
+STRLEN
+test_isUTF8_CHAR_flags(char *s, STRLEN len, U32 flags)
+    CODE:
+        RETVAL = isUTF8_CHAR_flags((U8 *) s, (U8 *) s + len, flags);
+    OUTPUT:
+        RETVAL
+
+STRLEN
+test_isSTRICT_UTF8_CHAR(char *s, STRLEN len)
+    CODE:
+        RETVAL = isSTRICT_UTF8_CHAR((U8 *) s, (U8 *) s + len);
+    OUTPUT:
+        RETVAL
+
+STRLEN
+test_isC9_STRICT_UTF8_CHAR(char *s, STRLEN len)
+    CODE:
+        RETVAL = isC9_STRICT_UTF8_CHAR((U8 *) s, (U8 *) s + len);
+    OUTPUT:
+        RETVAL
+
+IV
+test_is_utf8_valid_partial_char_flags(char *s, STRLEN len, U32 flags)
+    CODE:
+        /* RETVAL should be bool, but making it IV allows us to test it
+         * returning 0 or 1 */
+        RETVAL = is_utf8_valid_partial_char_flags((U8 *) s, (U8 *) s + len, 
flags);
+    OUTPUT:
+        RETVAL
+
 UV
 test_toLOWER(UV ord)
     CODE:
diff --git a/ext/XS-APItest/t/utf8.t b/ext/XS-APItest/t/utf8.t
index 735feba..8122534 100644
--- a/ext/XS-APItest/t/utf8.t
+++ b/ext/XS-APItest/t/utf8.t
@@ -70,6 +70,10 @@ my $UTF8_WARN_SUPER             = 0x0400;
 my $UTF8_DISALLOW_ABOVE_31_BIT  = 0x0800;
 my $UTF8_WARN_ABOVE_31_BIT      = 0x1000;
 my $UTF8_CHECK_ONLY             = 0x2000;
+my $UTF8_DISALLOW_ILLEGAL_C9_INTERCHANGE
+                             = $UTF8_DISALLOW_SUPER|$UTF8_DISALLOW_SURROGATE;
+my $UTF8_DISALLOW_ILLEGAL_INTERCHANGE
+              = $UTF8_DISALLOW_ILLEGAL_C9_INTERCHANGE|$UTF8_DISALLOW_NONCHAR;
 
 # Test uvchr_to_utf8().
 my $UNICODE_WARN_SURROGATE        = 0x0001;
@@ -338,7 +342,26 @@ for my $u (sort { utf8::unicode_to_native($a) <=> 
utf8::unicode_to_native($b) }
         "Verify UTF8_SKIP(chr $hex_n) is $uvchr_skip_should_be");
 
     use bytes;
-    for (my $j = 0; $j < length $n_chr; $j++) {
+    my $byte_length = length $n_chr;
+    for (my $j = 0; $j < $byte_length; $j++) {
+        undef @warnings;
+
+        if ($j == $byte_length - 1) {
+            my $ret = test_is_utf8_valid_partial_char_flags($n_chr, 
$byte_length, 0);
+            is($ret, 0, "   Verify is_utf8_valid_partial_char_flags(" . 
display_bytes($n_chr) . ") returns 0 for full character");
+        }
+        else {
+            my $bytes_so_far = substr($n_chr, 0, $j + 1);
+            my $ret = test_is_utf8_valid_partial_char_flags($bytes_so_far, $j 
+ 1, 0);
+            is($ret, 1, "   Verify is_utf8_valid_partial_char_flags(" . 
display_bytes($bytes_so_far) . ") returns 1");
+        }
+
+        unless (is(scalar @warnings, 0,
+                "   Verify is_utf8_valid_partial_char_flags generated no 
warnings"))
+        {
+            diag "The warnings were: " . join(", ", @warnings);
+        }
+
         my $b = substr($n_chr, $j, 1);
         my $hex_b = sprintf("\"\\x%02x\"", ord $b);
 
@@ -402,11 +425,17 @@ for my $u (sort { utf8::unicode_to_native($a) <=> 
utf8::unicode_to_native($b) }
         $this_utf8_flags &=
                         ~($UTF8_DISALLOW_ABOVE_31_BIT|$UTF8_WARN_ABOVE_31_BIT);
     }
+
+    my $valid_under_strict = 1;
+    my $valid_under_c9strict = 1;
     if ($n > 0x10FFFF) {
         $this_utf8_flags &= ~($UTF8_DISALLOW_SUPER|$UTF8_WARN_SUPER);
+        $valid_under_strict = 0;
+        $valid_under_c9strict = 0;
     }
     elsif (($n & 0xFFFE) == 0xFFFE) {
         $this_utf8_flags &= ~($UTF8_DISALLOW_NONCHAR|$UTF8_WARN_NONCHAR);
+        $valid_under_strict = 0;
     }
 
     undef @warnings;
@@ -447,6 +476,96 @@ for my $u (sort { utf8::unicode_to_native($a) <=> 
utf8::unicode_to_native($b) }
 
     undef @warnings;
 
+    $ret = test_isUTF8_CHAR_flags($bytes, $len, 0);
+    is($ret, $len, "Verify isUTF8_CHAR_flags($display_bytes, 0) returns 
expected length: $len");
+
+    unless (is(scalar @warnings, 0,
+               "Verify isUTF8_CHAR_flags() for $hex_n generated no warnings"))
+    {
+        diag "The warnings were: " . join(", ", @warnings);
+    }
+
+    undef @warnings;
+
+    $ret = test_isUTF8_CHAR_flags($bytes, $len - 1, 0);
+    is($ret, 0, "Verify isUTF8_CHAR_flags() with too short length parameter 
returns 0");
+
+    unless (is(scalar @warnings, 0,
+               "Verify isUTF8_CHAR_flags() generated no warnings"))
+    {
+        diag "The warnings were: " . join(", ", @warnings);
+    }
+
+    undef @warnings;
+
+    $ret = test_isSTRICT_UTF8_CHAR($bytes, $len);
+    my $expected_len = ($valid_under_strict) ? $len : 0;
+    is($ret, $expected_len, "Verify isSTRICT_UTF8_CHAR($display_bytes) returns 
expected length: $expected_len");
+
+    unless (is(scalar @warnings, 0,
+               "Verify isSTRICT_UTF8_CHAR() for $hex_n generated no warnings"))
+    {
+        diag "The warnings were: " . join(", ", @warnings);
+    }
+
+    undef @warnings;
+
+    $ret = test_isSTRICT_UTF8_CHAR($bytes, $len - 1);
+    is($ret, 0, "Verify isSTRICT_UTF8_CHAR() with too short length parameter 
returns 0");
+
+    unless (is(scalar @warnings, 0,
+               "Verify isSTRICT_UTF8_CHAR() generated no warnings"))
+    {
+        diag "The warnings were: " . join(", ", @warnings);
+    }
+
+    undef @warnings;
+
+    $ret = test_isUTF8_CHAR_flags($bytes, $len, 
$UTF8_DISALLOW_ILLEGAL_INTERCHANGE);
+    is($ret, $expected_len, "Verify 
isUTF8_CHAR_flags('DISALLOW_ILLEGAL_INTERCHANGE') acts like 
isSTRICT_UTF8_CHAR");
+
+    unless (is(scalar @warnings, 0,
+               "Verify isUTF8_CHAR() for $hex_n generated no warnings"))
+    {
+        diag "The warnings were: " . join(", ", @warnings);
+    }
+
+    undef @warnings;
+
+    $ret = test_isC9_STRICT_UTF8_CHAR($bytes, $len);
+    $expected_len = ($valid_under_c9strict) ? $len : 0;
+    is($ret, $expected_len, "Verify isC9_STRICT_UTF8_CHAR($display_bytes) 
returns expected length: $len");
+
+    unless (is(scalar @warnings, 0,
+               "Verify isC9_STRICT_UTF8_CHAR() for $hex_n generated no 
warnings"))
+    {
+        diag "The warnings were: " . join(", ", @warnings);
+    }
+
+    undef @warnings;
+
+    $ret = test_isC9_STRICT_UTF8_CHAR($bytes, $len - 1);
+    is($ret, 0, "Verify isC9_STRICT_UTF8_CHAR() with too short length 
parameter returns 0");
+
+    unless (is(scalar @warnings, 0,
+               "Verify isC9_STRICT_UTF8_CHAR() generated no warnings"))
+    {
+        diag "The warnings were: " . join(", ", @warnings);
+    }
+
+    undef @warnings;
+
+    $ret = test_isUTF8_CHAR_flags($bytes, $len, 
$UTF8_DISALLOW_ILLEGAL_C9_INTERCHANGE);
+    is($ret, $expected_len, "Verify 
isUTF8_CHAR_flags('DISALLOW_ILLEGAL_C9_INTERCHANGE') acts like 
isC9_STRICT_UTF8_CHAR");
+
+    unless (is(scalar @warnings, 0,
+               "Verify isUTF8_CHAR() for $hex_n generated no warnings"))
+    {
+        diag "The warnings were: " . join(", ", @warnings);
+    }
+
+    undef @warnings;
+
     $ret_ref = test_valid_utf8_to_uvchr($bytes);
     is($ret_ref->[0], $n, "Verify valid_utf8_to_uvchr($display_bytes) returns 
$hex_n");
     is($ret_ref->[1], $len, "Verify valid_utf8_to_uvchr() for $hex_n returns 
expected length: $len");
@@ -715,6 +834,78 @@ foreach my $test (@malformations) {
         diag "The warnings were: " . join(", ", @warnings);
     }
 
+    undef @warnings;
+
+    $ret = test_isUTF8_CHAR_flags($bytes, $length, 0);
+    is($ret, 0, "$testname: isUTF8_CHAR_flags returns 0");
+    unless (is(scalar @warnings, 0,
+               "$testname: isUTF8_CHAR() generated no warnings"))
+    {
+        diag "The warnings were: " . join(", ", @warnings);
+    }
+
+    $ret = test_isSTRICT_UTF8_CHAR($bytes, $length);
+    is($ret, 0, "$testname: isSTRICT_UTF8_CHAR returns 0");
+    unless (is(scalar @warnings, 0,
+               "$testname: isSTRICT_UTF8_CHAR() generated no warnings"))
+    {
+        diag "The warnings were: " . join(", ", @warnings);
+    }
+
+    $ret = test_isC9_STRICT_UTF8_CHAR($bytes, $length);
+    is($ret, 0, "$testname: isC9_STRICT_UTF8_CHAR returns 0");
+    unless (is(scalar @warnings, 0,
+               "$testname: isC9_STRICT_UTF8_CHAR() generated no warnings"))
+    {
+        diag "The warnings were: " . join(", ", @warnings);
+    }
+
+    for my $j (1 .. $length - 1) {
+        my $partial = substr($bytes, 0, $j);
+
+        undef @warnings;
+
+        $ret = test_is_utf8_valid_partial_char_flags($bytes, $j, 0);
+        my $ret_should_be = 0;
+        my $comment = "";
+        if ($testname =~ /premature|short/ && $j < 2) {
+            $ret_should_be = 1;
+            $comment = ", but need 2 bytes to discern:";
+        }
+        elsif ($testname =~ /overlong/ && $length > 2) {
+            if ($length <= 7 && $j < 2) {
+                $ret_should_be = 1;
+                $comment = ", but need 2 bytes to discern:";
+            }
+            elsif ($length > 7 && $j < 7) {
+                $ret_should_be = 1;
+                $comment = ", but need 7 bytes to discern:";
+            }
+        }
+        elsif ($testname =~ /overflow/ && $testname !~ /first byte/) {
+            if (isASCII) {
+                if ($j < (($is64bit) ? 3 : 2)) {
+                    $comment = ", but need $j bytes to discern:";
+                    $ret_should_be = 1;
+                }
+            }
+            else {
+                if ($j < (($is64bit) ? 2 : 8)) {
+                    $comment = ", but need $j bytes to discern:";
+                    $ret_should_be = 1;
+                }
+            }
+        }
+        is($ret, $ret_should_be, "$testname: is_utf8_valid_partial_char_flags("
+                                . display_bytes($partial)
+                                . ")$comment returns $ret_should_be");
+        unless (is(scalar @warnings, 0,
+                "$testname: is_utf8_valid_partial_char_flags() generated no 
warnings"))
+        {
+            diag "The warnings were: " . join(", ", @warnings);
+        }
+    }
+
 
     # Test what happens when this malformation is not allowed
     undef @warnings;
@@ -1162,18 +1353,121 @@ foreach my $test (@tests) {
         use warnings;
         undef @warnings;
         my $ret = test_isUTF8_CHAR($bytes, $length);
+        my $ret_flags = test_isUTF8_CHAR_flags($bytes, $length, 0);
         if ($will_overflow) {
             is($ret, 0, "isUTF8_CHAR() $testname: returns 0");
+            is($ret_flags, 0, "isUTF8_CHAR_flags() $testname: returns 0");
         }
         else {
             is($ret, $length,
                "isUTF8_CHAR() $testname: returns expected length: $length");
+            is($ret_flags, $length,
+               "isUTF8_CHAR_flags(...,0) $testname: returns expected length: 
$length");
         }
         unless (is(scalar @warnings, 0,
-                "isUTF8_CHAR() $testname: generated no warnings"))
+                "isUTF8_CHAR() and isUTF8_CHAR()_flags $testname: generated no 
warnings"))
         {
             diag "The warnings were: " . join(", ", @warnings);
         }
+
+        undef @warnings;
+        $ret = test_isSTRICT_UTF8_CHAR($bytes, $length);
+        if ($will_overflow) {
+            is($ret, 0, "isSTRICT_UTF8_CHAR() $testname: returns 0");
+        }
+        else {
+            my $expected_ret = (   $testname =~ /surrogate|non-character/
+                                || $allowed_uv > 0x10FFFF)
+                               ? 0
+                               : $length;
+            is($ret, $expected_ret,
+               "isSTRICT_UTF8_CHAR() $testname: returns expected length: 
$expected_ret");
+            $ret = test_isUTF8_CHAR_flags($bytes, $length,
+                                          $UTF8_DISALLOW_ILLEGAL_INTERCHANGE);
+            is($ret, $expected_ret,
+               "isUTF8_CHAR_flags('DISALLOW_ILLEGAL_INTERCHANGE') acts like 
isSTRICT_UTF8_CHAR");
+        }
+        unless (is(scalar @warnings, 0,
+                "isSTRICT_UTF8_CHAR() and isUTF8_CHAR_flags $testname: 
generated no warnings"))
+        {
+            diag "The warnings were: " . join(", ", @warnings);
+        }
+
+        undef @warnings;
+        $ret = test_isC9_STRICT_UTF8_CHAR($bytes, $length);
+        if ($will_overflow) {
+            is($ret, 0, "isC9_STRICT_UTF8_CHAR() $testname: returns 0");
+        }
+        else {
+            my $expected_ret = (   $testname =~ /surrogate/
+                                || $allowed_uv > 0x10FFFF)
+                               ? 0
+                               : $length;
+            is($ret, $expected_ret,
+               "isC9_STRICT_UTF8_CHAR() $testname: returns expected length: 
$expected_ret");
+            $ret = test_isUTF8_CHAR_flags($bytes, $length,
+                                          
$UTF8_DISALLOW_ILLEGAL_C9_INTERCHANGE);
+            is($ret, $expected_ret,
+               "isUTF8_CHAR_flags('DISALLOW_ILLEGAL_C9_INTERCHANGE') acts like 
isC9_STRICT_UTF8_CHAR");
+        }
+        unless (is(scalar @warnings, 0,
+                "isC9_STRICT_UTF8_CHAR() and isUTF8_CHAR_flags $testname: 
generated no warnings"))
+        {
+            diag "The warnings were: " . join(", ", @warnings);
+        }
+
+        # Test partial character handling, for each byte not a full character
+        for my $j (1.. $length - 1) {
+
+            # Skip the test for the interaction between overflow and above-31
+            # bit.  It is really testing other things than the partial
+            # character tests, for which other tests in this file are
+            # sufficient
+            last if $testname =~ /overflow/;
+
+            foreach my $disallow_flag (0, $disallow_flags) {
+                my $partial = substr($bytes, 0, $j);
+                my $ret_should_be;
+                my $comment;
+                if ($disallow_flag) {
+                    $ret_should_be = 0;
+                    $comment = "disallowed";
+                }
+                else {
+                    $ret_should_be = 1;
+                    $comment = "allowed";
+                }
+
+                if ($disallow_flag) {
+                    if ($testname =~ /non-character/) {
+                        $ret_should_be = 1;
+                        $comment .= ", but but need full char to discern";
+                    }
+                    elsif ($testname =~ /surrogate/) {
+                        if ($j < 2) {
+                            $ret_should_be = 1;
+                            $comment .= ", but need 2 bytes to discern";
+                        }
+                    }
+                    elsif ($testname =~ /first non_unicode/ && $j < 2) {
+                        $ret_should_be = 1;
+                        $comment .= ", but need 2 bytes to discern";
+                    }
+                }
+
+                undef @warnings;
+
+                $ret = test_is_utf8_valid_partial_char_flags($partial, $j, 
$disallow_flag);
+                is($ret, $ret_should_be, "$testname: 
is_utf8_valid_partial_char_flags("
+                                        . display_bytes($partial)
+                                        . "), $comment: returns 
$ret_should_be");
+                unless (is(scalar @warnings, 0,
+                        "$testname: is_utf8_valid_partial_char_flags() 
generated no warnings"))
+                {
+                    diag "The warnings were: " . join(", ", @warnings);
+                }
+            }
+        }
     }
 
     # This is more complicated than the malformations tested earlier, as there
diff --git a/handy.h b/handy.h
index d131cfc..5428d7c 100644
--- a/handy.h
+++ b/handy.h
@@ -489,6 +489,7 @@ Returns zero if non-equal, or non-zero if equal.
 #  define memEQ(s1,s2,l) (!bcmp(s1,s2,l))
 #endif
 
+/* memEQ and memNE where second comparand is a string constant */
 #define memEQs(s1, l, s2) \
        (sizeof(s2)-1 == l && memEQ(s1, ("" s2 ""), (sizeof(s2)-1)))
 #define memNEs(s1, l, s2) !memEQs(s1, l, s2)
diff --git a/inline.h b/inline.h
index 74343a1..e4b857d 100644
--- a/inline.h
+++ b/inline.h
@@ -290,7 +290,7 @@ non-Unicode code points are allowed.
 PERL_STATIC_INLINE UV
 Perl_valid_utf8_to_uvchr(const U8 *s, STRLEN *retlen)
 {
-    UV expectlen = UTF8SKIP(s);
+    const UV expectlen = UTF8SKIP(s);
     const U8* send = s + expectlen;
     UV uv = *s;
 
@@ -323,13 +323,12 @@ Perl_valid_utf8_to_uvchr(const U8 *s, STRLEN *retlen)
 /*
 =for apidoc is_utf8_invariant_string
 
-Returns true iff the first C<len> bytes of the string C<s> are the same
+Returns TRUE if the first C<len> bytes of the string C<s> are the same
 regardless of the UTF-8 encoding of the string (or UTF-EBCDIC encoding on
-EBCDIC machines).  That is, if they are UTF-8 invariant.  On ASCII-ish
-machines, all the ASCII characters and only the ASCII characters fit this
-definition.  On EBCDIC machines, the ASCII-range characters are invariant, but
-so also are the C1 controls and C<\c?> (which isn't in the ASCII range on
-EBCDIC).
+EBCDIC machines); otherwise it returns FALSE.  That is, it returns TRUE if they
+are UTF-8 invariant.  On ASCII-ish machines, all the ASCII characters and only
+the ASCII characters fit this definition.  On EBCDIC machines, the ASCII-range
+characters are invariant, but so also are the C1 controls.
 
 If C<len> is 0, it will be calculated using C<strlen(s)>, (which means if you
 use this option, that C<s> can't have embedded C<NUL> characters and has to
@@ -360,11 +359,14 @@ S_is_utf8_invariant_string(const U8* const s, const 
STRLEN len)
 /*
 =for apidoc is_utf8_string
 
-Returns true if the first C<len> bytes of string C<s> form a valid
-UTF-8 string, false otherwise.  If C<len> is 0, it will be calculated
-using C<strlen(s)> (which means if you use this option, that C<s> can't have
-embedded C<NUL> characters and has to have a terminating C<NUL> byte).  Note
-that all characters being ASCII constitute 'a valid UTF-8 string'.
+Returns TRUE if the first C<len> bytes of string C<s> form a valid
+Perl-extended-UTF-8 string; returns FALSE otherwise.  If C<len> is 0, it will
+be calculated using C<strlen(s)> (which means if you use this option, that C<s>
+can't have embedded C<NUL> characters and has to have a terminating C<NUL>
+byte).  Note that all characters being ASCII constitute 'a valid UTF-8 string'.
+
+Code points above Unicode, surrogates, and non-character code points are
+considered valid by this function.
 
 See also L</is_utf8_invariant_string>(), L</is_utf8_string_loclen>(), and
 L</is_utf8_string_loc>().
@@ -373,7 +375,7 @@ L</is_utf8_string_loc>().
 */
 
 PERL_STATIC_INLINE bool
-Perl_is_utf8_string(const U8 *s, STRLEN len)
+Perl_is_utf8_string(const U8 *s, const STRLEN len)
 {
     /* This is now marked pure in embed.fnc, because isUTF8_CHAR now is pure.
      * Be aware of possible changes to that */
@@ -384,11 +386,11 @@ Perl_is_utf8_string(const U8 *s, STRLEN len)
     PERL_ARGS_ASSERT_IS_UTF8_STRING;
 
     while (x < send) {
-        STRLEN len = isUTF8_CHAR(x, send);
-        if (UNLIKELY(! len)) {
+        const STRLEN cur_len = isUTF8_CHAR(x, send);
+        if (UNLIKELY(! cur_len)) {
             return FALSE;
         }
-        x += len;
+        x += cur_len;
     }
 
     return TRUE;
@@ -401,7 +403,7 @@ Implemented as a macro in utf8.h
 
 Like L</is_utf8_string> but stores the location of the failure (in the
 case of "utf8ness failure") or the location C<s>+C<len> (in the case of
-"utf8ness success") in the C<ep>.
+"utf8ness success") in the C<ep> pointer.
 
 See also L</is_utf8_string_loclen>() and L</is_utf8_string>().
 
@@ -410,7 +412,7 @@ See also L</is_utf8_string_loclen>() and 
L</is_utf8_string>().
 Like L</is_utf8_string>() but stores the location of the failure (in the
 case of "utf8ness failure") or the location C<s>+C<len> (in the case of
 "utf8ness success") in the C<ep>, and the number of UTF-8
-encoded characters in the C<el>.
+encoded characters in the C<el> pointer.
 
 See also L</is_utf8_string_loc>() and L</is_utf8_string>().
 
@@ -418,7 +420,7 @@ See also L</is_utf8_string_loc>() and L</is_utf8_string>().
 */
 
 PERL_STATIC_INLINE bool
-Perl_is_utf8_string_loclen(const U8 *s, STRLEN len, const U8 **ep, STRLEN *el)
+Perl_is_utf8_string_loclen(const U8 *s, const STRLEN len, const U8 **ep, 
STRLEN *el)
 {
     const U8* const send = s + (len ? len : strlen((const char *)s));
     const U8* x = s;
@@ -427,11 +429,11 @@ Perl_is_utf8_string_loclen(const U8 *s, STRLEN len, const 
U8 **ep, STRLEN *el)
     PERL_ARGS_ASSERT_IS_UTF8_STRING_LOCLEN;
 
     while (x < send) {
-        STRLEN len = isUTF8_CHAR(x, send);
-        if (UNLIKELY(! len)) {
+        const STRLEN cur_len = isUTF8_CHAR(x, send);
+        if (UNLIKELY(! cur_len)) {
             break;
         }
-        x += len;
+        x += cur_len;
         outlen++;
     }
 
@@ -505,39 +507,63 @@ Perl_utf8_hop(const U8 *s, SSize_t off)
 
 =for apidoc is_utf8_valid_partial_char
 
-Returns 1 if there exists some sequence of bytes, call it C<s'>, that when
-appended to the sequence from C<s> through S<C<e - 1>> causes the entire
-sequence starting at C<s> (including C<s'>) to be the well-formed UTF-8 of
-some code point; otherwise returns 0.
+Returns 0 if the sequence of bytes starting at C<s> and looking no further than
+S<C<e - 1>> is the UTF-8 encoding, as extended by Perl, for one or more code
+points.  Otherwise, it returns 1 if there exists at least one non-empty
+sequence of bytes that when appended to sequence C<s>, starting at position
+C<e> causes the entire sequence to be the well-formed UTF-8 of some code point;
+otherwise returns 0.
+
+In other words this returns TRUE if C<s> points to a partial UTF-8-encoded code
+point.
+
+This is useful when a fixed-length buffer is being tested for being well-formed
+UTF-8, but the final few bytes in it don't comprise a full character; that is,
+it is split somewhere in the middle of the final code point's UTF-8
+representation.  (Presumably when the buffer is refreshed with the next chunk
+of data, the new first bytes will complete the partial code point.)   This
+function is used to verify that the final bytes in the current buffer are in
+fact the legal beginning of some code point, so that if they aren't, the
+failure can be signalled without having to wait for the next read.
 
-In other words this returns TRUE if C<s> points to the beginning, but partial,
-sequence of the UTF-8 for some code point.
+=cut
+*/
+#define is_utf8_valid_partial_char(s, e) is_utf8_valid_partial_char_flags(s, 
e, 0)
+
+/*
 
-This is useful when some fixed-length buffer is being tested for being
-well-formed UTF-8, but the final few bytes in it don't comprise a full
-character: it is split somewhere in the middle of its UTF-8 representation.
-(Presumably when the buffer is refreshed with the next chunk of data, the new
-first bytes will complete the partial code point.)   This function is used to
-verify that the final bytes in the current buffer are in fact the legal
-beginning of some code point, so that if they aren't, the failure can be
-signalled without having to wait for the next read.
+=for apidoc is_utf8_valid_partial_char_flags
 
-If the bytes terminated at S<C<e - 1>> are a full character (or more), 0 is
-returned.
+Like C<L</is_utf8_valid_partial_char>>, it returns a boolean giving whether
+or not the input is a valid UTF-8 encoded partial character, but it takes an
+extra parameter, C<flags>, which can further restrict which code points are
+considered valid.
+
+If C<flags> is 0, this behaves identically to
+C<L</is_utf8_valid_partial_char>>.  Otherwise C<flags> can be any combination
+of the C<UTF8_DISALLOW_I<foo>> flags accepted by C<L</utf8n_to_uvchr>>.  If
+there is any sequence of bytes that can complete the input partial character in
+such a way that a non-prohibited character is formed, the function returns
+TRUE; otherwise FALSE.  Non characters cannot be determined based on partial
+character input.  But many  of the other possible excluded types can be
+determined from just the first one or two bytes.
 
 =cut
-*/
+ */
+
 PERL_STATIC_INLINE bool
-S_is_utf8_valid_partial_char(const U8 * const s, const U8 * const e)
+S_is_utf8_valid_partial_char_flags(const U8 * const s, const U8 * const e, 
const U32 flags)
 {
+    PERL_ARGS_ASSERT_IS_UTF8_VALID_PARTIAL_CHAR_FLAGS;
 
-    PERL_ARGS_ASSERT_IS_UTF8_VALID_PARTIAL_CHAR;
+    assert(0 == (flags & ~(UTF8_DISALLOW_ILLEGAL_INTERCHANGE
+                          |UTF8_DISALLOW_ABOVE_31_BIT)));
 
-    if (s >= e || s + UTF8SKIP(s) < e) {
+    if (s >= e || s + UTF8SKIP(s) <= e) {
         return FALSE;
     }
 
-    return cBOOL(_is_utf8_char_slow(s, e - s));
+    return cBOOL(_is_utf8_char_helper(s, e, flags));
 }
 
 /* ------------------------------- perl.h ----------------------------- */
diff --git a/mathoms.c b/mathoms.c
index 1480186..a3f20e7 100644
--- a/mathoms.c
+++ b/mathoms.c
@@ -690,7 +690,7 @@ Perl_init_i18nl14n(pTHX_ int printwarn)
 }
 
 bool
-Perl_is_utf8_string_loc(const U8 *s, STRLEN len, const U8 **ep)
+Perl_is_utf8_string_loc(const U8 *s, const STRLEN len, const U8 **ep)
 {
     PERL_ARGS_ASSERT_IS_UTF8_STRING_LOC;
 
diff --git a/pod/perldelta.pod b/pod/perldelta.pod
index cf414cd..732df74 100644
--- a/pod/perldelta.pod
+++ b/pod/perldelta.pod
@@ -342,7 +342,9 @@ well.
 
 =item *
 
-XXX
+Several macros and functions have been added to the public API for
+dealing with Unicode and UTF-8-encoded strings.  See
+L<perlapi/Unicode Support>.
 
 =back
 
diff --git a/proto.h b/proto.h
index ec99684..f57fb35 100644
--- a/proto.h
+++ b/proto.h
@@ -1609,17 +1609,17 @@ PERL_CALLCONV bool      Perl_is_utf8_space(pTHX_ const 
U8 *p)
 #define PERL_ARGS_ASSERT_IS_UTF8_SPACE \
        assert(p)
 
-PERL_STATIC_INLINE bool        Perl_is_utf8_string(const U8 *s, STRLEN len)
+PERL_STATIC_INLINE bool        Perl_is_utf8_string(const U8 *s, const STRLEN 
len)
                        __attribute__pure__;
 #define PERL_ARGS_ASSERT_IS_UTF8_STRING        \
        assert(s)
 
 #ifndef NO_MATHOMS
-PERL_CALLCONV bool     Perl_is_utf8_string_loc(const U8 *s, STRLEN len, const 
U8 **ep);
+PERL_CALLCONV bool     Perl_is_utf8_string_loc(const U8 *s, const STRLEN len, 
const U8 **ep);
 #define PERL_ARGS_ASSERT_IS_UTF8_STRING_LOC    \
        assert(s); assert(ep)
 #endif
-PERL_STATIC_INLINE bool        Perl_is_utf8_string_loclen(const U8 *s, STRLEN 
len, const U8 **ep, STRLEN *el);
+PERL_STATIC_INLINE bool        Perl_is_utf8_string_loclen(const U8 *s, const 
STRLEN len, const U8 **ep, STRLEN *el);
 #define PERL_ARGS_ASSERT_IS_UTF8_STRING_LOCLEN \
        assert(s)
 PERL_CALLCONV bool     Perl_is_utf8_upper(pTHX_ const U8 *p)
@@ -1628,9 +1628,12 @@ PERL_CALLCONV bool       Perl_is_utf8_upper(pTHX_ const 
U8 *p)
 #define PERL_ARGS_ASSERT_IS_UTF8_UPPER \
        assert(p)
 
-PERL_STATIC_INLINE bool        S_is_utf8_valid_partial_char(const U8 * const 
s, const U8 * const e)
+/* PERL_CALLCONV bool  is_utf8_valid_partial_char(const U8 * const s, const U8 
* const e)
+                       __attribute__pure__; */
+
+PERL_STATIC_INLINE bool        S_is_utf8_valid_partial_char_flags(const U8 * 
const s, const U8 * const e, const U32 flags)
                        __attribute__pure__;
-#define PERL_ARGS_ASSERT_IS_UTF8_VALID_PARTIAL_CHAR    \
+#define PERL_ARGS_ASSERT_IS_UTF8_VALID_PARTIAL_CHAR_FLAGS      \
        assert(s); assert(e)
 
 PERL_CALLCONV bool     Perl_is_utf8_xdigit(pTHX_ const U8 *p)
@@ -3810,11 +3813,11 @@ STATIC SV *     S_incpush_if_exists(pTHX_ AV *const av, 
SV *dir, SV *const stem);
 #  endif
 #endif
 #if !defined(PERL_NO_INLINE_FUNCTIONS)
-PERL_CALLCONV STRLEN   Perl__is_utf8_char_slow(const U8 * const s, const 
STRLEN len)
+PERL_CALLCONV STRLEN   Perl__is_utf8_char_helper(const U8 * const s, const U8 
* const e, const U32 flags)
                        __attribute__warn_unused_result__
                        __attribute__pure__;
-#define PERL_ARGS_ASSERT__IS_UTF8_CHAR_SLOW    \
-       assert(s)
+#define PERL_ARGS_ASSERT__IS_UTF8_CHAR_HELPER  \
+       assert(s); assert(e)
 
 PERL_STATIC_INLINE void        S_append_utf8_from_native_byte(const U8 byte, 
U8** dest);
 #define PERL_ARGS_ASSERT_APPEND_UTF8_FROM_NATIVE_BYTE  \
diff --git a/regcharclass.h b/regcharclass.h
index e0a021a..6f5d14b 100644
--- a/regcharclass.h
+++ b/regcharclass.h
@@ -1876,6 +1876,6 @@
  * 5c7eb94310e2aaa15702fd6bed24ff0e7ab5448f9a8231d8c49ca96c9e941089 
lib/unicore/mktables
  * cdecb300baad839a6f62791229f551a4fa33f3cbdca08e378dc976466354e778 
lib/unicore/version
  * 913d2f93f3cb6cdf1664db888bf840bc4eb074eef824e082fceda24a9445e60c 
regen/charset_translations.pl
- * 1876ece914e2c14ed38c8a589adaa3d8193532c3a5bbe9ea5c3279bc9d29b279 
regen/regcharclass.pl
+ * 66e20f857451956f9fc7ad7432de972e84fb857885009838878bcf6f91ffbeef 
regen/regcharclass.pl
  * 393f8d882713a3ba227351ad0f00ea4839fda74fcf77dcd1cdf31519925adba5 
regen/regcharclass_multi_char_folds.pl
  * ex: set ro: */
diff --git a/regen/regcharclass.pl b/regen/regcharclass.pl
index bd677ac..f3f8b99 100755
--- a/regen/regcharclass.pl
+++ b/regen/regcharclass.pl
@@ -1660,6 +1660,59 @@ SURROGATE: Surrogate code points
 #=> UTF8 :no_length_checks only_ebcdic_platform
 #0xA0 - 0x1FFFFF
 
+#STRICT_UTF8_CHAR: Matches legal Unicode UTF-8 variant code points, no 
surrrogates nor non-character code points
+#=> UTF8 :no_length_checks only_ascii_platform
+#0x0080 - 0xD7FF
+#0xE000 - 0xFDCF
+#0xFDF0 - 0xFFFD
+#0x10000 - 0x1FFFD
+#0x20000 - 0x2FFFD
+#0x30000 - 0x3FFFD
+#0x40000 - 0x4FFFD
+#0x50000 - 0x5FFFD
+#0x60000 - 0x6FFFD
+#0x70000 - 0x7FFFD
+#0x80000 - 0x8FFFD
+#0x90000 - 0x9FFFD
+#0xA0000 - 0xAFFFD
+#0xB0000 - 0xBFFFD
+#0xC0000 - 0xCFFFD
+#0xD0000 - 0xDFFFD
+#0xE0000 - 0xEFFFD
+#0xF0000 - 0xFFFFD
+#0x100000 - 0x10FFFD
+#
+#STRICT_UTF8_CHAR: Matches legal Unicode UTF-8 variant code points, no 
surrrogates nor non-character code points
+#=> UTF8 :no_length_checks only_ebcdic_platform
+#0x00A0 - 0xD7FF
+#0xE000 - 0xFDCF
+#0xFDF0 - 0xFFFD
+#0x10000 - 0x1FFFD
+#0x20000 - 0x2FFFD
+#0x30000 - 0x3FFFD
+#0x40000 - 0x4FFFD
+#0x50000 - 0x5FFFD
+#0x60000 - 0x6FFFD
+#0x70000 - 0x7FFFD
+#0x80000 - 0x8FFFD
+#0x90000 - 0x9FFFD
+#0xA0000 - 0xAFFFD
+#0xB0000 - 0xBFFFD
+#0xC0000 - 0xCFFFD
+#0xD0000 - 0xDFFFD
+#0xE0000 - 0xEFFFD
+#0xF0000 - 0xFFFFD
+#0x100000 - 0x10FFFD
+
+#C9_STRICT_UTF8_CHAR: Matches legal Unicode UTF-8 variant code points, no 
surrogates
+#=> UTF8 :no_length_checks only_ascii_platform
+#0x0080 - 0xD7FF
+#0xE000 - 0x10FFFF
+#
+#C9_STRICT_UTF8_CHAR: Matches legal Unicode UTF-8 variant code points 
including non-character code points, no surrogates
+#=> UTF8 :no_length_checks only_ebcdic_platform
+#0x00A0 - 0xD7FF
+#0xE000 - 0x10FFFF
 
 QUOTEMETA: Meta-characters that \Q should quote
 => high :fast
diff --git a/utf8.c b/utf8.c
index 2b9ea5b..ec345ea 100644
--- a/utf8.c
+++ b/utf8.c
@@ -423,37 +423,120 @@ S_is_utf8_cp_above_31_bits(const U8 * const s, const U8 
* const e)
 
 }
 
-/*
-
-A helper function for the macro isUTF8_CHAR(), which should be used instead of
-this function.  The macro will handle smaller code points directly saving time,
-using this function as a fall-back for higher code points.  This function
-assumes that it is not called with an invariant character, and that
-'s + len - 1' is within bounds of the string 's'.
-
-Tests if the string C<s> of at least length 'len' is a valid variant UTF-8
-character.  0 is returned if not, otherwise, 'len' is returned.
-
-It is written in such a way that if 'len' is set to less than a full
-character's length, it will test if the bytes ending there form the legal
-beginning of partial character.
-
-*/
-
 STRLEN
-Perl__is_utf8_char_slow(const U8 * const s, const STRLEN len)
+Perl__is_utf8_char_helper(const U8 * const s, const U8 * const e, const U32 
flags)
 {
-    const U8 *e;
+    STRLEN len;
     const U8 *x, *y;
 
-    PERL_ARGS_ASSERT__IS_UTF8_CHAR_SLOW;
+    /* A helper function that should not be called directly.
+     *
+     * This function returns non-zero if the string beginning at 's' and
+     * looking no further than 'e - 1' is well-formed Perl-extended-UTF-8 for a
+     * code point; otherwise it returns 0.  The examination stops after the
+     * first code point in 's' is validated, not looking at the rest of the
+     * input.  If 'e' is such that there are not enough bytes to represent a
+     * complete code point, this function will return non-zero anyway, if the
+     * bytes it does have are well-formed UTF-8 as far as they go, and aren't
+     * excluded by 'flags'.
+     *
+     * A non-zero return gives the number of bytes required to represent the
+     * code point.  Be aware that if the input is for a partial character, the
+     * return will be larger than 'e - s'.
+     *
+     * This function assumes that the code point represented is UTF-8 variant.
+     * The caller should have excluded this possibility before calling this
+     * function.
+     *
+     * 'flags' can be 0, or any combination of the UTF8_DISALLOW_foo flags
+     * accepted by L</utf8n_to_uvchr>.  If non-zero, this function will return
+     * 0 if the code point represented is well-formed Perl-extended-UTF-8, but
+     * disallowed by the flags.  If the input is only for a partial character,
+     * the function will return non-zero if there is any sequence of
+     * well-formed UTF-8 that, when appended to the input sequence, could
+     * result in an allowed code point; otherwise it returns 0.  Non characters
+     * cannot be determined based on partial character input.  But many  of the
+     * other excluded types can be determined with just the first one or two
+     * bytes.
+     *
+     */
+
+    PERL_ARGS_ASSERT__IS_UTF8_CHAR_HELPER;
 
+    assert(0 == (flags & ~(UTF8_DISALLOW_ILLEGAL_INTERCHANGE
+                          |UTF8_DISALLOW_ABOVE_31_BIT)));
+    assert(! UTF8_IS_INVARIANT(*s));
+
+    /* A variant char must begin with a start byte */
     if (UNLIKELY(! UTF8_IS_START(*s))) {
         return 0;
     }
 
-    e = s + len;
+    len = e - s;
+
+    if (flags && isUTF8_POSSIBLY_PROBLEMATIC(*s)) {
+        const U8 s0 = NATIVE_UTF8_TO_I8(s[0]);
+
+        /* The code below is derived from this table.  Keep in mind that legal
+         * continuation bytes range between \x80..\xBF for UTF-8, and
+         * \xA0..\xBF for I8.  Anything above those aren't continuation bytes.
+         * Hence, we don't have to test the upper edge because if any of those
+         * are encountered, the sequence is malformed, and will fail elsewhere
+         * in this function.
+         *              UTF-8            UTF-EBCDIC I8
+         *   U+D800: \xED\xA0\x80      \xF1\xB6\xA0\xA0      First surrogate
+         *   U+DFFF: \xED\xBF\xBF      \xF1\xB7\xBF\xBF      Final surrogate
+         * U+110000: \xF4\x90\x80\x80  \xF9\xA2\xA0\xA0\xA0  First above 
Unicode
+         *
+         */
+
+#ifdef EBCDIC   /* On EBCDIC, these are actually I8 bytes */
+#  define FIRST_START_BYTE_THAT_IS_DEFINITELY_SUPER  0xFA
+#  define IS_SUPER_2_BYTE(s0, s1)                ((s0) == 0xF9 && (s1) >= 0xA2)
 
+                                                               /* B6 and B7 */
+#  define IS_SURROGATE(s0, s1)         ((s0) == 0xF1 && ((s1) & 0xFE ) == 0xB6)
+#else
+#  define FIRST_START_BYTE_THAT_IS_DEFINITELY_SUPER  0xF5
+#  define IS_SUPER_2_BYTE(s0, s1)                ((s0) == 0xF4 && (s1) >= 0x90)
+#  define IS_SURROGATE(s0, s1)                   ((s0) == 0xED && (s1) >= 0xA0)
+#endif
+
+        if (  (flags & UTF8_DISALLOW_SUPER)
+            && UNLIKELY(s0 >= FIRST_START_BYTE_THAT_IS_DEFINITELY_SUPER)) {
+            return 0;           /* Above Unicode */
+        }
+
+        if (   (flags & UTF8_DISALLOW_ABOVE_31_BIT)
+            &&  UNLIKELY(is_utf8_cp_above_31_bits(s, e)))
+        {
+            return 0;           /* Above 31 bits */
+        }
+
+        if (len > 1) {
+            const U8 s1 = NATIVE_UTF8_TO_I8(s[1]);
+
+            if (   (flags & UTF8_DISALLOW_SUPER)
+                &&  UNLIKELY(IS_SUPER_2_BYTE(s0, s1)))
+            {
+                return 0;       /* Above Unicode */
+            }
+
+            if (   (flags & UTF8_DISALLOW_SURROGATE)
+                &&  UNLIKELY(IS_SURROGATE(s0, s1)))
+            {
+                return 0;       /* Surrogate */
+            }
+
+            if (  (flags & UTF8_DISALLOW_NONCHAR)
+                && UNLIKELY(UTF8_IS_NONCHAR(s, e)))
+            {
+                return 0;       /* Noncharacter code point */
+            }
+        }
+    }
+
+    /* Make sure that all that follows are continuation bytes */
     for (x = s + 1; x < e; x++) {
         if (UNLIKELY(! UTF8_IS_CONTINUATION(*x))) {
             return 0;
@@ -555,9 +638,18 @@ Perl__is_utf8_char_slow(const U8 * const s, const STRLEN 
len)
         break;
     }
 
-    return len;
+    return UTF8SKIP(s);
 }
 
+#undef FIRST_START_BYTE_THAT_IS_DEFINITELY_SUPER
+#undef IS_SUPER_2_BYTE
+#undef IS_SURROGATE
+#undef F0_ABOVE_OVERLONG
+#undef F8_ABOVE_OVERLONG
+#undef FC_ABOVE_OVERLONG
+#undef FE_ABOVE_OVERLONG
+#undef FF_OVERLONG_PREFIX
+
 /*
 
 =for apidoc utf8n_to_uvchr
@@ -4048,7 +4140,7 @@ Perl_check_utf8_print(pTHX_ const U8* s, const STRLEN len)
        }
        if (UNLIKELY(isUTF8_POSSIBLY_PROBLEMATIC(*s))) {
            STRLEN char_len;
-           if (UTF8_IS_SUPER(s, e)) {
+           if (UNLIKELY(UTF8_IS_SUPER(s, e))) {
                 if (   ckWARN_d(WARN_NON_UNICODE)
                     || (   ckWARN_d(WARN_DEPRECATED)
 #ifndef UV_IS_QUAD
@@ -4071,7 +4163,7 @@ Perl_check_utf8_print(pTHX_ const U8* s, const STRLEN len)
                     ok = FALSE;
                 }
            }
-           else if (UTF8_IS_SURROGATE(s, e)) {
+           else if (UNLIKELY(UTF8_IS_SURROGATE(s, e))) {
                if (ckWARN_d(WARN_SURROGATE)) {
                     /* This has a different warning than the one the called
                      * function would output, so can't just call it, unlike we
@@ -4082,7 +4174,7 @@ Perl_check_utf8_print(pTHX_ const U8* s, const STRLEN len)
                    ok = FALSE;
                }
            }
-           else if ((UTF8_IS_NONCHAR(s, e)) && (ckWARN_d(WARN_NONCHAR))) {
+           else if (UNLIKELY(UTF8_IS_NONCHAR(s, e)) && 
(ckWARN_d(WARN_NONCHAR))) {
                 /* A side effect of this function will be to warn */
                 (void) utf8n_to_uvchr(s, e - s, &char_len, UTF8_WARN_NONCHAR);
                ok = FALSE;
diff --git a/utf8.h b/utf8.h
index 7202dc4..392a86a 100644
--- a/utf8.h
+++ b/utf8.h
@@ -223,7 +223,11 @@ As you can see, the continuation bytes all begin with 
C<10>, and the
 leading bits of the start byte tell how many bytes there are in the
 encoded character.
 
-Perl's extended UTF-8 means we can have start bytes up to FF.
+Perl's extended UTF-8 means we can have start bytes up through FF, though any
+beginning with FF yields a code point that is too large for 32-bit ASCII
+platforms.  FF signals to use 13 bytes for the encoded character.  This breaks
+the paradigm that the number of leading bits gives how many total bytes there
+are in the character.
 
 */
 
@@ -330,6 +334,77 @@ C<cp> is Unicode if above 255; otherwise is 
platform-native.
 /* The above macro handles UTF-8 that has this start byte as the maximum */
 #define _IS_UTF8_CHAR_HIGHEST_START_BYTE 0xF7
 
+/* A helper macro for isSTRICT_UTF8_CHAR, so use that one instead of this.
+ * Like is_UTF8_CHAR_utf8_no_length_checks(), this was moved here and LIKELYs
+ * added manually.
+ *
+       STRICT_UTF8_CHAR: Matches legal Unicode UTF-8 variant code points, no
+                          surrrogates nor non-character code points
+*/
+/*** GENERATED CODE ***/
+#define is_STRICT_UTF8_CHAR_utf8_no_length_checks(s)                        \
+( ( 0xC2 <= ((U8*)s)[0] && ((U8*)s)[0] <= 0xDF ) ?                          \
+    ( LIKELY( ( ((U8*)s)[1] & 0xC0 ) == 0x80 ) ? 2 : 0 )                       
   \
+: ( 0xE0 == ((U8*)s)[0] ) ?                                                 \
+    ( LIKELY( ( ( ((U8*)s)[1] & 0xE0 ) == 0xA0 ) && ( ( ((U8*)s)[2] & 0xC0 ) 
== 0x80 ) ) ? 3 : 0 )\
+: ( ( 0xE1 <= ((U8*)s)[0] && ((U8*)s)[0] <= 0xEC ) || 0xEE == ((U8*)s)[0] ) ?\
+    ( ( ( ( ((U8*)s)[1] & 0xC0 ) == 0x80 ) && ( ( ((U8*)s)[2] & 0xC0 ) == 0x80 
) ) ? 3 : 0 )\
+: ( 0xED == ((U8*)s)[0] ) ?                                                 \
+    ( LIKELY( ( ( ((U8*)s)[1] & 0xE0 ) == 0x80 ) && ( ( ((U8*)s)[2] & 0xC0 ) 
== 0x80 ) ) ? 3 : 0 )\
+: ( 0xEF == ((U8*)s)[0] ) ?                                                 \
+    ( ( ( 0x80 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0xB6 ) || ( 0xB8 <= 
((U8*)s)[1] && ((U8*)s)[1] <= 0xBE ) ) ?\
+       ( LIKELY( ( ((U8*)s)[2] & 0xC0 ) == 0x80 ) ? 3 : 0 )                    
  \
+    : ( 0xB7 == ((U8*)s)[1] ) ?                                             \
+       ( LIKELY( ( ((U8*)s)[2] & 0xF0 ) == 0x80 || ( ((U8*)s)[2] & 0xF0 ) == 
0xB0 ) ? 3 : 0 )\
+    : ( ( 0xBF == ((U8*)s)[1] ) && ( 0x80 <= ((U8*)s)[2] && ((U8*)s)[2] <= 
0xBD ) ) ? 3 : 0 )\
+: ( 0xF0 == ((U8*)s)[0] ) ?                                                 \
+    ( ( ( 0x90 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0x9E ) || ( 0xA0 <= 
((U8*)s)[1] && ((U8*)s)[1] <= 0xAE ) || ( 0xB0 <= ((U8*)s)[1] && ((U8*)s)[1] <= 
0xBE ) ) ?\
+       ( LIKELY( ( ( ((U8*)s)[2] & 0xC0 ) == 0x80 ) && ( ( ((U8*)s)[3] & 0xC0 
) == 0x80 ) ) ? 4 : 0 )\
+    : ( ((U8*)s)[1] == 0x9F || ( ( ((U8*)s)[1] & 0xEF ) == 0xAF ) ) ?       \
+       ( ( 0x80 <= ((U8*)s)[2] && ((U8*)s)[2] <= 0xBE ) ?                  \
+           ( LIKELY( ( ((U8*)s)[3] & 0xC0 ) == 0x80 ) ? 4 : 0 )                
  \
+       : LIKELY( ( 0xBF == ((U8*)s)[2] ) && ( 0x80 <= ((U8*)s)[3] && 
((U8*)s)[3] <= 0xBD ) ) ? 4 : 0 )\
+    : 0 )                                                                   \
+: ( 0xF1 <= ((U8*)s)[0] && ((U8*)s)[0] <= 0xF3 ) ?                          \
+    ( ( ( ( ((U8*)s)[1] & 0xC8 ) == 0x80 ) || ( ( ((U8*)s)[1] & 0xCC ) == 0x88 
) || ( ( ((U8*)s)[1] & 0xCE ) == 0x8C ) || ( ( ((U8*)s)[1] & 0xCF ) == 0x8E ) ) 
?\
+       ( LIKELY( ( ( ((U8*)s)[2] & 0xC0 ) == 0x80 ) && ( ( ((U8*)s)[3] & 0xC0 
) == 0x80 ) ) ? 4 : 0 )\
+    : ( ( ((U8*)s)[1] & 0xCF ) == 0x8F ) ?                                  \
+       ( ( 0x80 <= ((U8*)s)[2] && ((U8*)s)[2] <= 0xBE ) ?                  \
+           ( LIKELY( ( ((U8*)s)[3] & 0xC0 ) == 0x80 ) ? 4 : 0 )                
  \
+       : LIKELY( ( 0xBF == ((U8*)s)[2] ) && ( 0x80 <= ((U8*)s)[3] && 
((U8*)s)[3] <= 0xBD ) ) ? 4 : 0 )\
+    : 0 )                                                                   \
+: ( 0xF4 == ((U8*)s)[0] ) ?                                                 \
+    ( ( 0x80 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0x8E ) ?                      \
+       ( LIKELY( ( ( ((U8*)s)[2] & 0xC0 ) == 0x80 ) && ( ( ((U8*)s)[3] & 0xC0 
) == 0x80 ) ) ? 4 : 0 )\
+    : ( 0x8F == ((U8*)s)[1] ) ?                                             \
+       ( ( 0x80 <= ((U8*)s)[2] && ((U8*)s)[2] <= 0xBE ) ?                  \
+           ( LIKELY( ( ((U8*)s)[3] & 0xC0 ) == 0x80 ) ? 4 : 0 )                
  \
+       : LIKELY( ( 0xBF == ((U8*)s)[2] ) && ( 0x80 <= ((U8*)s)[3] && 
((U8*)s)[3] <= 0xBD ) ) ? 4 : 0 )\
+    : 0 )                                                                   \
+: 0 )
+
+/*  Similarly,
+        C9_STRICT_UTF8_CHAR: Matches legal Unicode UTF-8 variant code
+                                     points, no surrogates
+       0x0080 - 0xD7FF
+       0xE000 - 0x10FFFF
+*/
+/*** GENERATED CODE ***/
+#define is_C9_STRICT_UTF8_CHAR_utf8_no_length_checks(s)                     \
+( ( 0xC2 <= ((U8*)s)[0] && ((U8*)s)[0] <= 0xDF ) ?                          \
+    ( LIKELY( ( ((U8*)s)[1] & 0xC0 ) == 0x80 ) ? 2 : 0 )                    \
+: ( 0xE0 == ((U8*)s)[0] ) ?                                                 \
+    ( LIKELY( ( ( ((U8*)s)[1] & 0xE0 ) == 0xA0 ) && ( ( ((U8*)s)[2] & 0xC0 ) 
== 0x80 ) ) ? 3 : 0 )\
+: ( ( 0xE1 <= ((U8*)s)[0] && ((U8*)s)[0] <= 0xEC ) || ( ((U8*)s)[0] & 0xFE ) 
== 0xEE ) ?\
+    ( LIKELY( ( ( ((U8*)s)[1] & 0xC0 ) == 0x80 ) && ( ( ((U8*)s)[2] & 0xC0 ) 
== 0x80 ) ) ? 3 : 0 )\
+: ( 0xED == ((U8*)s)[0] ) ?                                                 \
+    ( LIKELY( ( ( ((U8*)s)[1] & 0xE0 ) == 0x80 ) && ( ( ((U8*)s)[2] & 0xC0 ) 
== 0x80 ) ) ? 3 : 0 )\
+: ( 0xF0 == ((U8*)s)[0] ) ?                                                 \
+    ( LIKELY( ( ( 0x90 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0xBF ) && ( ( 
((U8*)s)[2] & 0xC0 ) == 0x80 ) ) && ( ( ((U8*)s)[3] & 0xC0 ) == 0x80 ) ) ? 4 : 
0 )\
+: ( 0xF1 <= ((U8*)s)[0] && ((U8*)s)[0] <= 0xF3 ) ?                          \
+    ( LIKELY( ( ( ( ((U8*)s)[1] & 0xC0 ) == 0x80 ) && ( ( ((U8*)s)[2] & 0xC0 ) 
== 0x80 ) ) && ( ( ((U8*)s)[3] & 0xC0 ) == 0x80 ) ) ? 4 : 0 )\
+: LIKELY( ( ( ( 0xF4 == ((U8*)s)[0] ) && ( ( ((U8*)s)[1] & 0xF0 ) == 0x80 ) ) 
&& ( ( ((U8*)s)[2] & 0xC0 ) == 0x80 ) ) && ( ( ((U8*)s)[3] & 0xC0 ) == 0x80 ) ) 
? 4 : 0 )
+
 #endif /* EBCDIC vs ASCII */
 
 /* 2**UTF_ACCUMULATION_SHIFT - 1 */
@@ -755,14 +830,14 @@ fit in an IV on the current machine.
                     && (    NATIVE_UTF8_TO_I8(*(s)) >  0xF9                 \
                         || (NATIVE_UTF8_TO_I8(*(s) + 1) >= 0xA2))           \
                     &&  LIKELY((s) + UTF8SKIP(s) <= (e)))                   \
-                    ? _is_utf8_char_slow(s, UTF8SKIP(s)) : 0)
+                    ? _is_utf8_char_helper(s, s + UTF8SKIP(s), 0) : 0)
 #else
 #   define UTF8_IS_SUPER(s, e)                                              \
                    ((    LIKELY((e) > (s) + 3)                              \
                      &&  (*(U8*) (s)) >= 0xF4                               \
                      && ((*(U8*) (s)) >  0xF4 || (*((U8*) (s) + 1) >= 0x90))\
                      &&  LIKELY((s) + UTF8SKIP(s) <= (e)))                  \
-                    ? _is_utf8_char_slow(s, UTF8SKIP(s)) : 0)
+                    ? _is_utf8_char_helper(s, s + UTF8SKIP(s), 0) : 0)
 #endif
 
 /* These are now machine generated, and the 'given' clause is no longer
@@ -881,17 +956,15 @@ point's representation.
 
 #define SHARP_S_SKIP 2
 
-/* If you want to exclude surrogates, and beyond legal Unicode, see the blame
- * log for earlier versions which gave details for these */
-
 /*
 
 =for apidoc Am|STRLEN|isUTF8_CHAR|const U8 *s|const U8 *e
 
 Evaluates to non-zero if the first few bytes of the string starting at C<s> and
-looking no further than S<C<e - 1>> are well-formed UTF-8 that represents some
-code point; otherwise it evaluates to 0.  If non-zero, the value gives how many
-many bytes starting at C<s> comprise the code point's representation.
+looking no further than S<C<e - 1>> are well-formed UTF-8, as extended by Perl,
+that represents some code point; otherwise it evaluates to 0.  If non-zero, the
+value gives how many bytes starting at C<s> comprise the code point's
+representation.
 
 The code point can be any that will fit in a UV on this machine, using Perl's
 extension to official UTF-8 to represent those higher than the Unicode maximum
@@ -914,15 +987,109 @@ is a valid UTF-8 character.
     (UNLIKELY((e) <= (s))                                                   \
     ? 0                                                                     \
     : (UTF8_IS_INVARIANT(*s))                                               \
-    ? 1                                                                     \
-    : UNLIKELY(((e) - (s)) < UTF8SKIP(s))                                   \
-      ? 0                                                                   \
-      : LIKELY(NATIVE_UTF8_TO_I8(*s) <= _IS_UTF8_CHAR_HIGHEST_START_BYTE)   \
-      ? is_UTF8_CHAR_utf8_no_length_checks(s)                               \
-      : _is_utf8_char_slow(s, UTF8SKIP(s)))
+      ? 1                                                                   \
+      : UNLIKELY(((e) - (s)) < UTF8SKIP(s))                                 \
+        ? 0                                                                 \
+        : LIKELY(NATIVE_UTF8_TO_I8(*s) <= _IS_UTF8_CHAR_HIGHEST_START_BYTE) \
+          ? is_UTF8_CHAR_utf8_no_length_checks(s)                           \
+          : _is_utf8_char_helper(s, e, 0))
 
 #define is_utf8_char_buf(buf, buf_end) isUTF8_CHAR(buf, buf_end)
 
+/*
+
+=for apidoc Am|STRLEN|isSTRICT_UTF8_CHAR|const U8 *s|const U8 *e
+
+Evaluates to non-zero if the first few bytes of the string starting at C<s> and
+looking no further than S<C<e - 1>> are well-formed UTF-8 that represents some
+Unicode code point completely acceptable for open interchange between all
+applications; otherwise it evaluates to 0.  If non-zero, the value gives how
+many bytes starting at C<s> comprise the code point's representation.
+
+The largest acceptable code point is the Unicode maximum 0x10FFFF, and must not
+be a surrogate nor a non-character code point.  Thus this excludes any code
+point from Perl's extended UTF-8.
+
+This is used to efficiently decide if the next few bytes in C<s> is
+legal Unicode-acceptable UTF-8 for a single character.  Use
+C<L</isC9_STRICT_UTF8_CHAR>> to also accept non-character code points.
+
+=cut
+*/
+
+#define isSTRICT_UTF8_CHAR(s, e)                                            \
+    (UNLIKELY((e) <= (s))                                                   \
+    ? 0                                                                     \
+    : (UTF8_IS_INVARIANT(*s))                                               \
+      ? 1                                                                   \
+      : UNLIKELY(((e) - (s)) < UTF8SKIP(s))                                 \
+        ? 0                                                                 \
+        : is_STRICT_UTF8_CHAR_utf8_no_length_checks(s))
+
+/*
+
+=for apidoc Am|STRLEN|isC9_STRICT_UTF8_CHAR|const U8 *s|const U8 *e
+
+Evaluates to non-zero if the first few bytes of the string starting at C<s> and
+looking no further than S<C<e - 1>> are well-formed UTF-8 that represents some
+Unicode non-surrogate code point; otherwise it evaluates to 0.  If non-zero,
+the value gives how many bytes starting at C<s> comprise the code point's
+representation.
+
+The largest acceptable code point is the Unicode maximum 0x10FFFF.  This
+differs from C<L</isSTRICT_UTF8_CHAR>> only in that it accepts non-character
+code points.  This corresponds to
+L<Unicode Corrigendum #9|http://www.unicode.org/versions/corrigendum9.html>.
+which said that non-character code points are merely discouraged rather than
+completely forbidden in open interchange.  See
+L<perlunicode/Noncharacter code points>.
+
+=cut
+*/
+
+#define isC9_STRICT_UTF8_CHAR(s, e)                                         \
+    (UNLIKELY((e) <= (s))                                                   \
+    ? 0                                                                     \
+    : (UTF8_IS_INVARIANT(*s))                                               \
+      ? 1                                                                   \
+      : UNLIKELY(((e) - (s)) < UTF8SKIP(s))                                 \
+        ? 0                                                                 \
+        : is_C9_STRICT_UTF8_CHAR_utf8_no_length_checks(s))
+
+/*
+
+=for apidoc Am|STRLEN|isUTF8_CHAR_flags|const U8 *s|const U8 *e| const U32 
flags
+
+Evaluates to non-zero if the first few bytes of the string starting at C<s> and
+looking no further than S<C<e - 1>> are well-formed UTF-8, as extended by Perl,
+that represents some code point, subject to the restrictions given by C<flags>;
+otherwise it evaluates to 0.  If non-zero, the value gives how many bytes
+starting at C<s> comprise the code point's representation.
+
+If C<flags> is 0, this gives the same results as C<L</isUTF8_CHAR>>;
+if C<flags> is C<UTF8_DISALLOW_ILLEGAL_INTERCHANGE>, this gives the same 
results
+as C<L</isSTRICT_UTF8_CHAR>>;
+and if C<flags> is C<UTF8_DISALLOW_ILLEGAL_C9_INTERCHANGE>, this gives
+the same results as C<L</isC9_STRICT_UTF8_CHAR>>.
+Otherwise C<flags> may be any combination of the C<UTF8_DISALLOW_I<foo>> flags
+understood by C<L</utf8n_to_uvchr>>, with the same meanings.
+
+The three alternative macros are for the most commonly needed validations; they
+are likely to run somewhat faster than this more general one, as they can be
+inlined into your code.
+
+=cut
+*/
+
+#define isUTF8_CHAR_flags(s, e, flags)                                      \
+    (UNLIKELY((e) <= (s))                                                   \
+    ? 0                                                                     \
+    : (UTF8_IS_INVARIANT(*s))                                               \
+      ? 1                                                                   \
+      : UNLIKELY(((e) - (s)) < UTF8SKIP(s))                                 \
+        ? 0                                                                 \
+        : _is_utf8_char_helper(s, e, flags))
+
 /* Do not use; should be deprecated.  Use isUTF8_CHAR() instead; this is
  * retained solely for backwards compatibility */
 #define IS_UTF8_CHAR(p, n)      (isUTF8_CHAR(p, (p) + (n)) == n)
diff --git a/utfebcdic.h b/utfebcdic.h
index 7d37fbc..fd247af 100644
--- a/utfebcdic.h
+++ b/utfebcdic.h
@@ -275,19 +275,21 @@ explicitly forbidden, and the shortest possible encoding 
should always be used
 #   define HIGHEST_REPRESENTABLE_UTF8  
"\xFF\xA0\xA0\xA0\xA0\xA0\xA0\xA3\xBF\xBF\xBF\xBF\xBF\xBF"
 #endif
 
-/* A helper macro for isUTF8_CHAR, so use that one instead of this.  This was
- * generated by regen/regcharclass.pl, and then moved here.  Then it was
+/* Helper macros for isUTF8_CHAR_foo, so use those instead of this.  These were
+ * generated by regen/regcharclass.pl, and then moved here.  Then they were
  * hand-edited to add some LIKELY() calls, presuming that malformations are
  * unlikely.  The lines that generated it were then commented out.  This was
  * done because it takes on the order of 10 minutes to generate, and is never
- * going to change, unless the generated code is improved, and figuring out
- * the LIKELYs there would be hard.
+ * going to change, unless the generated code is improved, and figuring out the
+ * LIKELYs there would be hard.
  *
-        UTF8_CHAR: Matches legal UTF-EBCDIC variant code points up through 
0x1FFFFFF
+ */
+
+#if '^' == 95 /* CP 1047 */
+/*      UTF8_CHAR: Matches legal UTF-EBCDIC variant code points up through 
0x1FFFFFF
 
        0xA0 - 0x1FFFFF
 */
-#if '^' == 95 /* CP 1047 */
 
 /*** GENERATED CODE ***/
 #define is_UTF8_CHAR_utf8_no_length_checks(s)                               \
@@ -303,6 +305,93 @@ explicitly forbidden, and the shortest possible encoding 
should always be used
     ( LIKELY( ( ( ( ( 0x49 == ((U8*)s)[1] || 0x4A == ((U8*)s)[1] ) || ( 0x51 
<= ((U8*)s)[1] && ((U8*)s)[1] <= 0x59 ) || ( 0x62 <= ((U8*)s)[1] && ((U8*)s)[1] 
<= 0x6A ) || ( ((U8*)s)[1] & 0xFC ) == 0x7 ... [584 chars truncated]
 : ( ( ( ( ( 0xEE == ((U8*)s)[0] ) && LIKELY( ( 0x41 <= ((U8*)s)[1] && 
((U8*)s)[1] <= 0x4A ) || ( 0x51 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0x59 ) || ( 
0x62 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0x6A ) || ( (( ... [628 chars truncated]
 
+/*      UTF8_CHAR_STRICT: Matches legal Unicode UTF-8 variant code points, no
+                          surrrogates nor non-character code points */
+/*** GENERATED CODE ***/
+#define is_STRICT_UTF8_CHAR_utf8_no_length_checks_part0(s)                  \
+( ( ( 0x41 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0x4A ) || ( 0x51 <= ((U8*)s)[1] && 
((U8*)s)[1] <= 0x59 ) || ( 0x62 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0x6A ) || ( 
0x70 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0x72  ... [6 chars truncated]
+       ( LIKELY( ( ( 0x41 <= ((U8*)s)[2] && ((U8*)s)[2] <= 0x4A ) || ( 0x51 <= 
((U8*)s)[2] && ((U8*)s)[2] <= 0x59 ) || ( 0x62 <= ((U8*)s)[2] && ((U8*)s)[2] <= 
0x6A ) || ( ((U8*)s)[2] & 0xFC ) == 0x70 ) &&  ... [197 chars truncated]
+    : ( 0x73 == ((U8*)s)[1] ) ?                                             \
+       ( ( ( 0x41 <= ((U8*)s)[2] && ((U8*)s)[2] <= 0x4A ) || ( 0x51 <= 
((U8*)s)[2] && ((U8*)s)[2] <= 0x59 ) || ( 0x62 <= ((U8*)s)[2] && ((U8*)s)[2] <= 
0x6A ) || ( 0x70 <= ((U8*)s)[2] && ((U8*)s)[2] <= 0x72 ... [7 chars truncated]
+           ( LIKELY( ( 0x41 <= ((U8*)s)[3] && ((U8*)s)[3] <= 0x4A ) || ( 0x51 
<= ((U8*)s)[3] && ((U8*)s)[3] <= 0x59 ) || ( 0x62 <= ((U8*)s)[3] && ((U8*)s)[3] 
<= 0x6A ) || ( ((U8*)s)[3] & 0xFC ) == 0x70 ) ? ... [9 chars truncated]
+       : LIKELY( ( 0x73 == ((U8*)s)[2] ) && ( ( 0x41 <= ((U8*)s)[3] && 
((U8*)s)[3] <= 0x4A ) || ( 0x51 <= ((U8*)s)[3] && ((U8*)s)[3] <= 0x59 ) || ( 
0x62 <= ((U8*)s)[3] && ((U8*)s)[3] <= 0x6A ) || ( ((U8*)s ... [36 chars 
truncated]
+    : 0 )
+
+
+/*** GENERATED CODE ***/
+#define is_STRICT_UTF8_CHAR_utf8_no_length_checks_part1(s)                  \
+( ( 0xED == ((U8*)s)[0] ) ?                                                 \
+    ( ( ( ( ((U8*)s)[1] & 0xEF ) == 0x49 ) || ( ( ((U8*)s)[1] & 0xF9 ) == 0x51 
) || ((U8*)s)[1] == 0x63 || ( ( ((U8*)s)[1] & 0xFD ) == 0x65 ) || ((U8*)s)[1] 
== 0x69 || ( ( ((U8*)s)[1] & 0xFD ) == 0x7 ... [8 chars truncated]
+       ( LIKELY( ( ( ( 0x41 <= ((U8*)s)[2] && ((U8*)s)[2] <= 0x4A ) || ( 0x51 
<= ((U8*)s)[2] && ((U8*)s)[2] <= 0x59 ) || ( 0x62 <= ((U8*)s)[2] && ((U8*)s)[2] 
<= 0x6A ) || ( ((U8*)s)[2] & 0xFC ) == 0x70 ) & ... [389 chars truncated]
+    : ( ((U8*)s)[1] == 0x4A || ((U8*)s)[1] == 0x52 || ( ( ((U8*)s)[1] & 0xFD ) 
== 0x54 ) || ((U8*)s)[1] == 0x58 || ((U8*)s)[1] == 0x62 || ( ( ((U8*)s)[1] & 
0xFD ) == 0x64 ) || ( ( ((U8*)s)[1] & 0xFD  ... [54 chars truncated]
+       ( ( ( 0x41 <= ((U8*)s)[2] && ((U8*)s)[2] <= 0x4A ) || ( 0x51 <= 
((U8*)s)[2] && ((U8*)s)[2] <= 0x59 ) || ( 0x62 <= ((U8*)s)[2] && ((U8*)s)[2] <= 
0x6A ) || ( 0x70 <= ((U8*)s)[2] && ((U8*)s)[2] <= 0x72 ... [7 chars truncated]
+           ( LIKELY( ( ( 0x41 <= ((U8*)s)[3] && ((U8*)s)[3] <= 0x4A ) || ( 
0x51 <= ((U8*)s)[3] && ((U8*)s)[3] <= 0x59 ) || ( 0x62 <= ((U8*)s)[3] && 
((U8*)s)[3] <= 0x6A ) || ( ((U8*)s)[3] & 0xFC ) == 0x70 ) ... [201 chars 
truncated]
+       : ( 0x73 == ((U8*)s)[2] ) ?                                         \
+           ( ( ( 0x41 <= ((U8*)s)[3] && ((U8*)s)[3] <= 0x4A ) || ( 0x51 <= 
((U8*)s)[3] && ((U8*)s)[3] <= 0x59 ) || ( 0x62 <= ((U8*)s)[3] && ((U8*)s)[3] <= 
0x6A ) || ( 0x70 <= ((U8*)s)[3] && ((U8*)s)[3] <=  ... [11 chars truncated]
+               ( LIKELY( ( 0x41 <= ((U8*)s)[4] && ((U8*)s)[4] <= 0x4A ) || ( 
0x51 <= ((U8*)s)[4] && ((U8*)s)[4] <= 0x59 ) || ( 0x62 <= ((U8*)s)[4] && 
((U8*)s)[4] <= 0x6A ) || ( ((U8*)s)[4] & 0xFC ) == 0x70 ) ? 5  ... [6 chars 
truncated]
+           : LIKELY( ( 0x73 == ((U8*)s)[3] ) && ( ( 0x41 <= ((U8*)s)[4] && 
((U8*)s)[4] <= 0x4A ) || ( 0x51 <= ((U8*)s)[4] && ((U8*)s)[4] <= 0x59 ) || ( 
0x62 <= ((U8*)s)[4] && ((U8*)s)[4] <= 0x6A ) || ( ((U ... [40 chars truncated]
+       : 0 )                                                               \
+    : 0 )                                                                   \
+: ( 0xEE == ((U8*)s)[0] ) ?                                                 \
+    ( ( 0x41 == ((U8*)s)[1] ) ?                                             \
+       ( LIKELY( ( ( ( 0x41 <= ((U8*)s)[2] && ((U8*)s)[2] <= 0x4A ) || ( 0x51 
<= ((U8*)s)[2] && ((U8*)s)[2] <= 0x59 ) || ( 0x62 <= ((U8*)s)[2] && ((U8*)s)[2] 
<= 0x6A ) || ( ((U8*)s)[2] & 0xFC ) == 0x70 ) & ... [389 chars truncated]
+    : ( 0x42 == ((U8*)s)[1] ) ?                                             \
+       ( ( ( 0x41 <= ((U8*)s)[2] && ((U8*)s)[2] <= 0x4A ) || ( 0x51 <= 
((U8*)s)[2] && ((U8*)s)[2] <= 0x59 ) || ( 0x62 <= ((U8*)s)[2] && ((U8*)s)[2] <= 
0x6A ) || ( 0x70 <= ((U8*)s)[2] && ((U8*)s)[2] <= 0x72 ... [7 chars truncated]
+           ( LIKELY( ( ( 0x41 <= ((U8*)s)[3] && ((U8*)s)[3] <= 0x4A ) || ( 
0x51 <= ((U8*)s)[3] && ((U8*)s)[3] <= 0x59 ) || ( 0x62 <= ((U8*)s)[3] && 
((U8*)s)[3] <= 0x6A ) || ( ((U8*)s)[3] & 0xFC ) == 0x70 ) ... [201 chars 
truncated]
+       : ( 0x73 == ((U8*)s)[2] ) ?                                         \
+           ( ( ( 0x41 <= ((U8*)s)[3] && ((U8*)s)[3] <= 0x4A ) || ( 0x51 <= 
((U8*)s)[3] && ((U8*)s)[3] <= 0x59 ) || ( 0x62 <= ((U8*)s)[3] && ((U8*)s)[3] <= 
0x6A ) || ( 0x70 <= ((U8*)s)[3] && ((U8*)s)[3] <=  ... [11 chars truncated]
+               ( LIKELY( ( 0x41 <= ((U8*)s)[4] && ((U8*)s)[4] <= 0x4A ) || ( 
0x51 <= ((U8*)s)[4] && ((U8*)s)[4] <= 0x59 ) || ( 0x62 <= ((U8*)s)[4] && 
((U8*)s)[4] <= 0x6A ) || ( ((U8*)s)[4] & 0xFC ) == 0x70 ) ? 5  ... [6 chars 
truncated]
+           : LIKELY( ( 0x73 == ((U8*)s)[3] ) && ( ( 0x41 <= ((U8*)s)[4] && 
((U8*)s)[4] <= 0x4A ) || ( 0x51 <= ((U8*)s)[4] && ((U8*)s)[4] <= 0x59 ) || ( 
0x62 <= ((U8*)s)[4] && ((U8*)s)[4] <= 0x6A ) || ( ((U ... [40 chars truncated]
+       : 0 )                                                               \
+    : 0 )                                                                   \
+: 0 )
+
+
+/*** GENERATED CODE ***/
+#define is_STRICT_UTF8_CHAR_utf8_no_length_checks(s)                        \
+( ( 0x80 == ((U8*)s)[0] || ( 0x8A <= ((U8*)s)[0] && ((U8*)s)[0] <= 0x90 ) || ( 
0x9A <= ((U8*)s)[0] && ((U8*)s)[0] <= 0xA0 ) || ( 0xAA <= ((U8*)s)[0] && 
((U8*)s)[0] <= 0xAC ) || ( 0xAE <= ((U8*)s)[0]  ... [29 chars truncated]
+    ( LIKELY( ( 0x41 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0x4A ) || ( 0x51 <= 
((U8*)s)[1] && ((U8*)s)[1] <= 0x59 ) || ( 0x62 <= ((U8*)s)[1] && ((U8*)s)[1] <= 
0x6A ) || ( ((U8*)s)[1] & 0xFC ) == 0x70 ) ?  ... [8 chars truncated]
+: ( ( ( ((U8*)s)[0] & 0xFC ) == 0xB8 ) || ((U8*)s)[0] == 0xBC || ( ( 
((U8*)s)[0] & 0xFE ) == 0xBE ) || ( ( ((U8*)s)[0] & 0xEE ) == 0xCA ) || ( ( 
((U8*)s)[0] & 0xFC ) == 0xCC ) ) ?\
+    ( LIKELY( ( ( 0x41 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0x4A ) || ( 0x51 <= 
((U8*)s)[1] && ((U8*)s)[1] <= 0x59 ) || ( 0x62 <= ((U8*)s)[1] && ((U8*)s)[1] <= 
0x6A ) || ( ((U8*)s)[1] & 0xFC ) == 0x70 )  ... [200 chars truncated]
+: ( 0xDC == ((U8*)s)[0] ) ?                                                 \
+    ( LIKELY( ( ( ( 0x57 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0x59 ) || ( 0x62 <= 
((U8*)s)[1] && ((U8*)s)[1] <= 0x6A ) || ( ((U8*)s)[1] & 0xFC ) == 0x70 ) && ( ( 
0x41 <= ((U8*)s)[2] && ((U8*)s)[2] <= 0x4 ... [342 chars truncated]
+: ( 0xDD == ((U8*)s)[0] ) ?                                                 \
+    ( ( ( 0x41 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0x4A ) || ( 0x51 <= 
((U8*)s)[1] && ((U8*)s)[1] <= 0x59 ) || ( 0x62 <= ((U8*)s)[1] && ((U8*)s)[1] <= 
0x64 ) || ( 0x67 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0 ... [60 chars truncated]
+       ( LIKELY( ( ( 0x41 <= ((U8*)s)[2] && ((U8*)s)[2] <= 0x4A ) || ( 0x51 <= 
((U8*)s)[2] && ((U8*)s)[2] <= 0x59 ) || ( 0x62 <= ((U8*)s)[2] && ((U8*)s)[2] <= 
0x6A ) || ( ((U8*)s)[2] & 0xFC ) == 0x70 ) &&  ... [197 chars truncated]
+    : ( 0x73 == ((U8*)s)[1] ) ?                                             \
+       ( ( ( 0x41 <= ((U8*)s)[2] && ((U8*)s)[2] <= 0x4A ) || ( 0x51 <= 
((U8*)s)[2] && ((U8*)s)[2] <= 0x54 ) || ( 0x57 <= ((U8*)s)[2] && ((U8*)s)[2] <= 
0x59 ) || ( 0x62 <= ((U8*)s)[2] && ((U8*)s)[2] <= 0x6A ... [57 chars truncated]
+           ( LIKELY( ( 0x41 <= ((U8*)s)[3] && ((U8*)s)[3] <= 0x4A ) || ( 0x51 
<= ((U8*)s)[3] && ((U8*)s)[3] <= 0x59 ) || ( 0x62 <= ((U8*)s)[3] && ((U8*)s)[3] 
<= 0x6A ) || ( ((U8*)s)[3] & 0xFC ) == 0x70 ) ? ... [9 chars truncated]
+       : ( 0x55 == ((U8*)s)[2] ) ?                                         \
+           ( LIKELY( ( 0x41 <= ((U8*)s)[3] && ((U8*)s)[3] <= 0x4A ) || ( 0x51 
<= ((U8*)s)[3] && ((U8*)s)[3] <= 0x56 ) ) ? 4 : 0 )\
+       : ( 0x56 == ((U8*)s)[2] ) ?                                         \
+           ( LIKELY( ( 0x57 <= ((U8*)s)[3] && ((U8*)s)[3] <= 0x59 ) || ( 0x62 
<= ((U8*)s)[3] && ((U8*)s)[3] <= 0x6A ) || ( ((U8*)s)[3] & 0xFC ) == 0x70 ) ? 4 
: 0 )\
+       : LIKELY( ( 0x73 == ((U8*)s)[2] ) && ( ( 0x41 <= ((U8*)s)[3] && 
((U8*)s)[3] <= 0x4A ) || ( 0x51 <= ((U8*)s)[3] && ((U8*)s)[3] <= 0x59 ) || ( 
0x62 <= ((U8*)s)[3] && ((U8*)s)[3] <= 0x6A ) || ( ((U8*)s ... [36 chars 
truncated]
+    : 0 )                                                                   \
+: ( 0xDE == ((U8*)s)[0] || 0xE1 == ((U8*)s)[0] || 0xEB == ((U8*)s)[0] ) ?   \
+    ( LIKELY( ( ( ( 0x41 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0x4A ) || ( 0x51 <= 
((U8*)s)[1] && ((U8*)s)[1] <= 0x59 ) || ( 0x62 <= ((U8*)s)[1] && ((U8*)s)[1] <= 
0x6A ) || ( ((U8*)s)[1] & 0xFC ) == 0x70  ... [392 chars truncated]
+: ( 0xDF == ((U8*)s)[0] || 0xEA == ((U8*)s)[0] || 0xEC == ((U8*)s)[0] ) ? 
is_STRICT_UTF8_CHAR_utf8_no_length_checks_part0(s) : 
is_STRICT_UTF8_CHAR_utf8_no_length_checks_part1(s) )
+
+/*      C9_STRICT_UTF8_CHAR: Matches legal Unicode UTF-8 variant code points
+                             including non-character code points, no surrogates
+       0x00A0 - 0xD7FF
+       0xE000 - 0x10FFFF
+*/
+/*** GENERATED CODE ***/
+#define is_C9_STRICT_UTF8_CHAR_utf8_no_length_checks(s)             \
+( ( 0x80 == ((U8*)s)[0] || ( 0x8A <= ((U8*)s)[0] && ((U8*)s)[0] <= 0x90 ) || ( 
0x9A <= ((U8*)s)[0] && ((U8*)s)[0] <= 0xA0 ) || ( 0xAA <= ((U8*)s)[0] && 
((U8*)s)[0] <= 0xAC ) || ( 0xAE <= ((U8*)s)[0]  ... [29 chars truncated]
+    ( LIKELY( ( 0x41 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0x4A ) || ( 0x51 <= 
((U8*)s)[1] && ((U8*)s)[1] <= 0x59 ) || ( 0x62 <= ((U8*)s)[1] && ((U8*)s)[1] <= 
0x6A ) || ( ((U8*)s)[1] & 0xFC ) == 0x70 ) ?  ... [8 chars truncated]
+: ( ( ( ((U8*)s)[0] & 0xFC ) == 0xB8 ) || ((U8*)s)[0] == 0xBC || ( ( 
((U8*)s)[0] & 0xFE ) == 0xBE ) || ( ( ((U8*)s)[0] & 0xEE ) == 0xCA ) || ( ( 
((U8*)s)[0] & 0xFC ) == 0xCC ) ) ?\
+    ( LIKELY( ( ( 0x41 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0x4A ) || ( 0x51 <= 
((U8*)s)[1] && ((U8*)s)[1] <= 0x59 ) || ( 0x62 <= ((U8*)s)[1] && ((U8*)s)[1] <= 
0x6A ) || ( ((U8*)s)[1] & 0xFC ) == 0x70 )  ... [200 chars truncated]
+: ( 0xDC == ((U8*)s)[0] ) ?                                                 \
+    ( LIKELY( ( ( ( 0x57 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0x59 ) || ( 0x62 <= 
((U8*)s)[1] && ((U8*)s)[1] <= 0x6A ) || ( ((U8*)s)[1] & 0xFC ) == 0x70 ) && ( ( 
0x41 <= ((U8*)s)[2] && ((U8*)s)[2] <= 0x4 ... [342 chars truncated]
+: ( 0xDD == ((U8*)s)[0] ) ?                                                 \
+    ( LIKELY( ( ( ( 0x41 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0x4A ) || ( 0x51 <= 
((U8*)s)[1] && ((U8*)s)[1] <= 0x59 ) || ( 0x62 <= ((U8*)s)[1] && ((U8*)s)[1] <= 
0x64 ) || ( 0x67 <= ((U8*)s)[1] && ((U8*) ... [442 chars truncated]
+: ( ( ((U8*)s)[0] & 0xFE ) == 0xDE || 0xE1 == ((U8*)s)[0] || ( 0xEA <= 
((U8*)s)[0] && ((U8*)s)[0] <= 0xEC ) ) ?\
+    ( LIKELY( ( ( ( 0x41 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0x4A ) || ( 0x51 <= 
((U8*)s)[1] && ((U8*)s)[1] <= 0x59 ) || ( 0x62 <= ((U8*)s)[1] && ((U8*)s)[1] <= 
0x6A ) || ( ((U8*)s)[1] & 0xFC ) == 0x70  ... [392 chars truncated]
+: ( 0xED == ((U8*)s)[0] ) ?                                                 \
+    ( LIKELY( ( ( ( ( 0x49 == ((U8*)s)[1] || 0x4A == ((U8*)s)[1] ) || ( 0x51 
<= ((U8*)s)[1] && ((U8*)s)[1] <= 0x59 ) || ( 0x62 <= ((U8*)s)[1] && ((U8*)s)[1] 
<= 0x6A ) || ( ((U8*)s)[1] & 0xFC ) == 0x7 ... [584 chars truncated]
+: LIKELY( ( ( ( ( 0xEE == ((U8*)s)[0] ) && ( 0x41 == ((U8*)s)[1] || 0x42 == 
((U8*)s)[1] ) ) && ( ( 0x41 <= ((U8*)s)[2] && ((U8*)s)[2] <= 0x4A ) || ( 0x51 
<= ((U8*)s)[2] && ((U8*)s)[2] <= 0x59 ) || (  ... [472 chars truncated]
+
 #endif
 
 #if '^' == 176 /* CP 037 */
@@ -321,10 +410,95 @@ explicitly forbidden, and the shortest possible encoding 
should always be used
     ( LIKELY( ( ( ( ( 0x49 == ((U8*)s)[1] || 0x4A == ((U8*)s)[1] ) || ( 0x51 
<= ((U8*)s)[1] && ((U8*)s)[1] <= 0x59 ) || 0x5F == ((U8*)s)[1] || ( 0x62 <= 
((U8*)s)[1] && ((U8*)s)[1] <= 0x6A ) || ( 0x70 ... [740 chars truncated]
 : ( ( ( ( ( 0xEE == ((U8*)s)[0] ) && LIKELY( ( 0x41 <= ((U8*)s)[1] && 
((U8*)s)[1] <= 0x4A ) || ( 0x51 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0x59 ) || 0x5F 
== ((U8*)s)[1] || ( 0x62 <= ((U8*)s)[1] && ((U8*) ... [784 chars truncated]
 
+/*** GENERATED CODE ***/
+#define is_STRICT_UTF8_CHAR_utf8_no_length_checks_part0(s)                  \
+( ( ( 0x41 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0x4A ) || ( 0x51 <= ((U8*)s)[1] && 
((U8*)s)[1] <= 0x59 ) || 0x5F == ((U8*)s)[1] || ( 0x62 <= ((U8*)s)[1] && 
((U8*)s)[1] <= 0x6A ) || ( ((U8*)s)[1] & 0xFE ) ... [13 chars truncated]
+       ( LIKELY( ( ( 0x41 <= ((U8*)s)[2] && ((U8*)s)[2] <= 0x4A ) || ( 0x51 <= 
((U8*)s)[2] && ((U8*)s)[2] <= 0x59 ) || 0x5F == ((U8*)s)[2] || ( 0x62 <= 
((U8*)s)[2] && ((U8*)s)[2] <= 0x6A ) || ( 0x70 <= ((U ... [275 chars truncated]
+    : ( 0x72 == ((U8*)s)[1] ) ?                                             \
+       ( ( ( 0x41 <= ((U8*)s)[2] && ((U8*)s)[2] <= 0x4A ) || ( 0x51 <= 
((U8*)s)[2] && ((U8*)s)[2] <= 0x59 ) || 0x5F == ((U8*)s)[2] || ( 0x62 <= 
((U8*)s)[2] && ((U8*)s)[2] <= 0x6A ) || ( ((U8*)s)[2] & 0xFE  ... [14 chars 
truncated]
+           ( LIKELY( ( 0x41 <= ((U8*)s)[3] && ((U8*)s)[3] <= 0x4A ) || ( 0x51 
<= ((U8*)s)[3] && ((U8*)s)[3] <= 0x59 ) || 0x5F == ((U8*)s)[3] || ( 0x62 <= 
((U8*)s)[3] && ((U8*)s)[3] <= 0x6A ) || ( 0x70 <= ( ... [48 chars truncated]
+       : LIKELY( ( 0x72 == ((U8*)s)[2] ) && ( ( 0x41 <= ((U8*)s)[3] && 
((U8*)s)[3] <= 0x4A ) || ( 0x51 <= ((U8*)s)[3] && ((U8*)s)[3] <= 0x59 ) || 0x5F 
== ((U8*)s)[3] || ( 0x62 <= ((U8*)s)[3] && ((U8*)s)[3] ... [48 chars truncated]
+    : 0 )
+
+
+/*** GENERATED CODE ***/
+#define is_STRICT_UTF8_CHAR_utf8_no_length_checks_part1(s)                  \
+( ( 0xED == ((U8*)s)[0] ) ?                                                 \
+    ( ( ( ( ((U8*)s)[1] & 0xEF ) == 0x49 ) || ( ( ((U8*)s)[1] & 0xF9 ) == 0x51 
) || ((U8*)s)[1] == 0x62 || ( ( ((U8*)s)[1] & 0xFD ) == 0x64 ) || ( ( 
((U8*)s)[1] & 0xFD ) == 0x68 ) || ((U8*)s)[1] == 0 ... [8 chars truncated]
+       ( LIKELY( ( ( ( 0x41 <= ((U8*)s)[2] && ((U8*)s)[2] <= 0x4A ) || ( 0x51 
<= ((U8*)s)[2] && ((U8*)s)[2] <= 0x59 ) || 0x5F == ((U8*)s)[2] || ( 0x62 <= 
((U8*)s)[2] && ((U8*)s)[2] <= 0x6A ) || ( 0x70 <= ( ... [506 chars truncated]
+    : ( ((U8*)s)[1] == 0x4A || ((U8*)s)[1] == 0x52 || ( ( ((U8*)s)[1] & 0xFD ) 
== 0x54 ) || ((U8*)s)[1] == 0x58 || ((U8*)s)[1] == 0x5F || ((U8*)s)[1] == 0x63 
|| ( ( ((U8*)s)[1] & 0xFD ) == 0x65 ) ||  ... [62 chars truncated]
+       ( ( ( 0x41 <= ((U8*)s)[2] && ((U8*)s)[2] <= 0x4A ) || ( 0x51 <= 
((U8*)s)[2] && ((U8*)s)[2] <= 0x59 ) || 0x5F == ((U8*)s)[2] || ( 0x62 <= 
((U8*)s)[2] && ((U8*)s)[2] <= 0x6A ) || ( ((U8*)s)[2] & 0xFE  ... [14 chars 
truncated]
+           ( LIKELY( ( ( 0x41 <= ((U8*)s)[3] && ((U8*)s)[3] <= 0x4A ) || ( 
0x51 <= ((U8*)s)[3] && ((U8*)s)[3] <= 0x59 ) || 0x5F == ((U8*)s)[3] || ( 0x62 
<= ((U8*)s)[3] && ((U8*)s)[3] <= 0x6A ) || ( 0x70 <= ... [279 chars truncated]
+       : ( 0x72 == ((U8*)s)[2] ) ?                                         \
+           ( ( ( 0x41 <= ((U8*)s)[3] && ((U8*)s)[3] <= 0x4A ) || ( 0x51 <= 
((U8*)s)[3] && ((U8*)s)[3] <= 0x59 ) || 0x5F == ((U8*)s)[3] || ( 0x62 <= 
((U8*)s)[3] && ((U8*)s)[3] <= 0x6A ) || ( ((U8*)s)[3] & 0 ... [18 chars 
truncated]
+               ( LIKELY( ( 0x41 <= ((U8*)s)[4] && ((U8*)s)[4] <= 0x4A ) || ( 
0x51 <= ((U8*)s)[4] && ((U8*)s)[4] <= 0x59 ) || 0x5F == ((U8*)s)[4] || ( 0x62 
<= ((U8*)s)[4] && ((U8*)s)[4] <= 0x6A ) || ( 0x70 <= ((U8 ... [45 chars 
truncated]
+           : LIKELY( ( 0x72 == ((U8*)s)[3] ) && ( ( 0x41 <= ((U8*)s)[4] && 
((U8*)s)[4] <= 0x4A ) || ( 0x51 <= ((U8*)s)[4] && ((U8*)s)[4] <= 0x59 ) || 0x5F 
== ((U8*)s)[4] || ( 0x62 <= ((U8*)s)[4] && ((U8*)s ... [52 chars truncated]
+       : 0 )                                                               \
+    : 0 )                                                                   \
+: ( 0xEE == ((U8*)s)[0] ) ?                                                 \
+    ( ( 0x41 == ((U8*)s)[1] ) ?                                             \
+       ( LIKELY( ( ( ( 0x41 <= ((U8*)s)[2] && ((U8*)s)[2] <= 0x4A ) || ( 0x51 
<= ((U8*)s)[2] && ((U8*)s)[2] <= 0x59 ) || 0x5F == ((U8*)s)[2] || ( 0x62 <= 
((U8*)s)[2] && ((U8*)s)[2] <= 0x6A ) || ( 0x70 <= ( ... [506 chars truncated]
+    : ( 0x42 == ((U8*)s)[1] ) ?                                             \
+       ( ( ( 0x41 <= ((U8*)s)[2] && ((U8*)s)[2] <= 0x4A ) || ( 0x51 <= 
((U8*)s)[2] && ((U8*)s)[2] <= 0x59 ) || 0x5F == ((U8*)s)[2] || ( 0x62 <= 
((U8*)s)[2] && ((U8*)s)[2] <= 0x6A ) || ( ((U8*)s)[2] & 0xFE  ... [14 chars 
truncated]
+           ( LIKELY( ( ( 0x41 <= ((U8*)s)[3] && ((U8*)s)[3] <= 0x4A ) || ( 
0x51 <= ((U8*)s)[3] && ((U8*)s)[3] <= 0x59 ) || 0x5F == ((U8*)s)[3] || ( 0x62 
<= ((U8*)s)[3] && ((U8*)s)[3] <= 0x6A ) || ( 0x70 <= ... [279 chars truncated]
+       : ( 0x72 == ((U8*)s)[2] ) ?                                         \
+           ( ( ( 0x41 <= ((U8*)s)[3] && ((U8*)s)[3] <= 0x4A ) || ( 0x51 <= 
((U8*)s)[3] && ((U8*)s)[3] <= 0x59 ) || 0x5F == ((U8*)s)[3] || ( 0x62 <= 
((U8*)s)[3] && ((U8*)s)[3] <= 0x6A ) || ( ((U8*)s)[3] & 0 ... [18 chars 
truncated]
+               ( LIKELY( ( 0x41 <= ((U8*)s)[4] && ((U8*)s)[4] <= 0x4A ) || ( 
0x51 <= ((U8*)s)[4] && ((U8*)s)[4] <= 0x59 ) || 0x5F == ((U8*)s)[4] || ( 0x62 
<= ((U8*)s)[4] && ((U8*)s)[4] <= 0x6A ) || ( 0x70 <= ((U8 ... [45 chars 
truncated]
+           : LIKELY( ( 0x72 == ((U8*)s)[3] ) && ( ( 0x41 <= ((U8*)s)[4] && 
((U8*)s)[4] <= 0x4A ) || ( 0x51 <= ((U8*)s)[4] && ((U8*)s)[4] <= 0x59 ) || 0x5F 
== ((U8*)s)[4] || ( 0x62 <= ((U8*)s)[4] && ((U8*)s ... [52 chars truncated]
+       : 0 )                                                               \
+    : 0 )                                                                   \
+: 0 )
+
+
+/*** GENERATED CODE ***/
+#define is_STRICT_UTF8_CHAR_utf8_no_length_checks(s)                        \
+( ( 0x78 == ((U8*)s)[0] || 0x80 == ((U8*)s)[0] || ( 0x8A <= ((U8*)s)[0] && 
((U8*)s)[0] <= 0x90 ) || ( 0x9A <= ((U8*)s)[0] && ((U8*)s)[0] <= 0xA0 ) || ( 
0xAA <= ((U8*)s)[0] && ((U8*)s)[0] <= 0xAF ) || ... [52 chars truncated]
+    ( LIKELY( ( 0x41 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0x4A ) || ( 0x51 <= 
((U8*)s)[1] && ((U8*)s)[1] <= 0x59 ) || 0x5F == ((U8*)s)[1] || ( 0x62 <= 
((U8*)s)[1] && ((U8*)s)[1] <= 0x6A ) || ( 0x70 <= (( ... [47 chars truncated]
+: ( ((U8*)s)[0] == 0xB7 || ( ( ((U8*)s)[0] & 0xFE ) == 0xB8 ) || ( ( 
((U8*)s)[0] & 0xFC ) == 0xBC ) || ( ( ((U8*)s)[0] & 0xEE ) == 0xCA ) || ( ( 
((U8*)s)[0] & 0xFC ) == 0xCC ) ) ?\
+    ( LIKELY( ( ( 0x41 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0x4A ) || ( 0x51 <= 
((U8*)s)[1] && ((U8*)s)[1] <= 0x59 ) || 0x5F == ((U8*)s)[1] || ( 0x62 <= 
((U8*)s)[1] && ((U8*)s)[1] <= 0x6A ) || ( 0x70 <=  ... [278 chars truncated]
+: ( 0xDC == ((U8*)s)[0] ) ?                                                 \
+    ( LIKELY( ( ( ( 0x57 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0x59 ) || 0x5F == 
((U8*)s)[1] || ( 0x62 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0x6A ) || ( 0x70 <= 
((U8*)s)[1] && ((U8*)s)[1] <= 0x72 ) ) && ( ( 0x ... [459 chars truncated]
+: ( 0xDD == ((U8*)s)[0] ) ?                                                 \
+    ( ( ( 0x41 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0x4A ) || ( 0x51 <= 
((U8*)s)[1] && ((U8*)s)[1] <= 0x59 ) || 0x5F == ((U8*)s)[1] || ( ((U8*)s)[1] & 
0xFE ) == 0x62 || ( 0x66 <= ((U8*)s)[1] && ((U8*)s)[ ... [51 chars truncated]
+       ( LIKELY( ( ( 0x41 <= ((U8*)s)[2] && ((U8*)s)[2] <= 0x4A ) || ( 0x51 <= 
((U8*)s)[2] && ((U8*)s)[2] <= 0x59 ) || 0x5F == ((U8*)s)[2] || ( 0x62 <= 
((U8*)s)[2] && ((U8*)s)[2] <= 0x6A ) || ( 0x70 <= ((U ... [275 chars truncated]
+    : ( 0x72 == ((U8*)s)[1] ) ?                                             \
+       ( ( ( 0x41 <= ((U8*)s)[2] && ((U8*)s)[2] <= 0x4A ) || ( 0x51 <= 
((U8*)s)[2] && ((U8*)s)[2] <= 0x54 ) || ( 0x57 <= ((U8*)s)[2] && ((U8*)s)[2] <= 
0x59 ) || 0x5F == ((U8*)s)[2] || ( 0x62 <= ((U8*)s)[2] ... [64 chars truncated]
+           ( LIKELY( ( 0x41 <= ((U8*)s)[3] && ((U8*)s)[3] <= 0x4A ) || ( 0x51 
<= ((U8*)s)[3] && ((U8*)s)[3] <= 0x59 ) || 0x5F == ((U8*)s)[3] || ( 0x62 <= 
((U8*)s)[3] && ((U8*)s)[3] <= 0x6A ) || ( 0x70 <= ( ... [48 chars truncated]
+       : ( 0x55 == ((U8*)s)[2] ) ?                                         \
+           ( LIKELY( ( 0x41 <= ((U8*)s)[3] && ((U8*)s)[3] <= 0x4A ) || ( 0x51 
<= ((U8*)s)[3] && ((U8*)s)[3] <= 0x56 ) ) ? 4 : 0 )\
+       : ( 0x56 == ((U8*)s)[2] ) ?                                         \
+           ( LIKELY( ( 0x57 <= ((U8*)s)[3] && ((U8*)s)[3] <= 0x59 ) || 0x5F == 
((U8*)s)[3] || ( 0x62 <= ((U8*)s)[3] && ((U8*)s)[3] <= 0x6A ) || ( 0x70 <= 
((U8*)s)[3] && ((U8*)s)[3] <= 0x72 ) ) ? 4 : 0 )\
+       : LIKELY( ( 0x72 == ((U8*)s)[2] ) && ( ( 0x41 <= ((U8*)s)[3] && 
((U8*)s)[3] <= 0x4A ) || ( 0x51 <= ((U8*)s)[3] && ((U8*)s)[3] <= 0x59 ) || 0x5F 
== ((U8*)s)[3] || ( 0x62 <= ((U8*)s)[3] && ((U8*)s)[3] ... [48 chars truncated]
+    : 0 )                                                                   \
+: ( 0xDE == ((U8*)s)[0] || 0xE1 == ((U8*)s)[0] || 0xEB == ((U8*)s)[0] ) ?   \
+    ( LIKELY( ( ( ( 0x41 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0x4A ) || ( 0x51 <= 
((U8*)s)[1] && ((U8*)s)[1] <= 0x59 ) || 0x5F == ((U8*)s)[1] || ( 0x62 <= 
((U8*)s)[1] && ((U8*)s)[1] <= 0x6A ) || ( 0x70 < ... [509 chars truncated]
+: ( 0xDF == ((U8*)s)[0] || 0xEA == ((U8*)s)[0] || 0xEC == ((U8*)s)[0] ) ? 
is_STRICT_UTF8_CHAR_utf8_no_length_checks_part0(s) : 
is_STRICT_UTF8_CHAR_utf8_no_length_checks_part1(s) )
+
+/*      C9_STRICT_UTF8_CHAR: Matches legal Unicode UTF-8 variant code points
+                             including non-character code points, no surrogates
+       0x00A0 - 0xD7FF
+       0xE000 - 0x10FFFF
+*/
+/*** GENERATED CODE ***/
+#define is_C9_STRICT_UTF8_CHAR_utf8_no_length_checks(s)             \
+( ( 0x78 == ((U8*)s)[0] || 0x80 == ((U8*)s)[0] || ( 0x8A <= ((U8*)s)[0] && 
((U8*)s)[0] <= 0x90 ) || ( 0x9A <= ((U8*)s)[0] && ((U8*)s)[0] <= 0xA0 ) || ( 
0xAA <= ((U8*)s)[0] && ((U8*)s)[0] <= 0xAF ) || ... [52 chars truncated]
+    ( LIKELY( ( 0x41 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0x4A ) || ( 0x51 <= 
((U8*)s)[1] && ((U8*)s)[1] <= 0x59 ) || 0x5F == ((U8*)s)[1] || ( 0x62 <= 
((U8*)s)[1] && ((U8*)s)[1] <= 0x6A ) || ( 0x70 <= (( ... [47 chars truncated]
+: ( ((U8*)s)[0] == 0xB7 || ( ( ((U8*)s)[0] & 0xFE ) == 0xB8 ) || ( ( 
((U8*)s)[0] & 0xFC ) == 0xBC ) || ( ( ((U8*)s)[0] & 0xEE ) == 0xCA ) || ( ( 
((U8*)s)[0] & 0xFC ) == 0xCC ) ) ?\
+    ( LIKELY( ( ( 0x41 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0x4A ) || ( 0x51 <= 
((U8*)s)[1] && ((U8*)s)[1] <= 0x59 ) || 0x5F == ((U8*)s)[1] || ( 0x62 <= 
((U8*)s)[1] && ((U8*)s)[1] <= 0x6A ) || ( 0x70 <=  ... [278 chars truncated]
+: ( 0xDC == ((U8*)s)[0] ) ?                                                 \
+    ( LIKELY( ( ( ( 0x57 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0x59 ) || 0x5F == 
((U8*)s)[1] || ( 0x62 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0x6A ) || ( 0x70 <= 
((U8*)s)[1] && ((U8*)s)[1] <= 0x72 ) ) && ( ( 0x ... [459 chars truncated]
+: ( 0xDD == ((U8*)s)[0] ) ?                                                 \
+    ( LIKELY( ( ( ( 0x41 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0x4A ) || ( 0x51 <= 
((U8*)s)[1] && ((U8*)s)[1] <= 0x59 ) || 0x5F == ((U8*)s)[1] || ( ((U8*)s)[1] & 
0xFE ) == 0x62 || ( 0x66 <= ((U8*)s)[1] && ... [543 chars truncated]
+: ( ( ((U8*)s)[0] & 0xFE ) == 0xDE || 0xE1 == ((U8*)s)[0] || ( 0xEA <= 
((U8*)s)[0] && ((U8*)s)[0] <= 0xEC ) ) ?\
+    ( LIKELY( ( ( ( 0x41 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0x4A ) || ( 0x51 <= 
((U8*)s)[1] && ((U8*)s)[1] <= 0x59 ) || 0x5F == ((U8*)s)[1] || ( 0x62 <= 
((U8*)s)[1] && ((U8*)s)[1] <= 0x6A ) || ( 0x70 < ... [509 chars truncated]
+: ( 0xED == ((U8*)s)[0] ) ?                                                 \
+    ( LIKELY( ( ( ( ( 0x49 == ((U8*)s)[1] || 0x4A == ((U8*)s)[1] ) || ( 0x51 
<= ((U8*)s)[1] && ((U8*)s)[1] <= 0x59 ) || 0x5F == ((U8*)s)[1] || ( 0x62 <= 
((U8*)s)[1] && ((U8*)s)[1] <= 0x6A ) || ( 0x70 ... [740 chars truncated]
+: LIKELY( ( ( ( ( 0xEE == ((U8*)s)[0] ) && ( 0x41 == ((U8*)s)[1] || 0x42 == 
((U8*)s)[1] ) ) && ( ( 0x41 <= ((U8*)s)[2] && ((U8*)s)[2] <= 0x4A ) || ( 0x51 
<= ((U8*)s)[2] && ((U8*)s)[2] <= 0x59 ) || 0x ... [589 chars truncated]
+
 #endif
 
-/* The above macro in both code pages handles UTF-8 that has this start byte
- * (expressed in I8) as the maximum */
+/* is_UTF8_CHAR_utf8_no_length_checks() in both code pages handles UTF-8 that
+ * has this start byte (expressed in I8) as the maximum */
 #define _IS_UTF8_CHAR_HIGHEST_START_BYTE 0xF9
 
 /*

--
Perl5 Master Repository

[perl.git] branch blead, updated. v5.25.4-167-g178122f

Reply via email to