In perl.git, the branch blead has been updated

<http://perl5.git.perl.org/perl.git/commitdiff/aff2be59e4b3bbb62744c92140fb0be47ae32d1c?hp=a0b61ef95c1627542742a80907516f7ab89aaed6>

- Log -----------------------------------------------------------------
commit aff2be59e4b3bbb62744c92140fb0be47ae32d1c
Author: Karl Williamson <k...@cpan.org>
Date:   Wed Oct 12 19:14:13 2016 -0600

    perlapi: Fix clause that should have been removed earlier

M       utf8.c

commit f4caf2b26940946566282dc266c4d4964d38d036
Author: Karl Williamson <k...@cpan.org>
Date:   Wed Oct 12 19:12:11 2016 -0600

    utf8.h: Simplify macro
    
    This complicated macro boils down to just one bit.

M       utf8.h

commit f9380377ac81dfbd87f997094a742406eb899a15
Author: Karl Williamson <k...@cpan.org>
Date:   Mon Oct 10 21:18:37 2016 -0600

    Add utf8n_to_uvchr_error
    
    This new function behaves like utf8n_to_uvchr(), but takes an extra
    parameter that points to a U32 which will be set to 0 if no errors are
    found; otherwise each error found will set a bit in it.  This can be
    used by the caller to figure out precisely what the error(s) is/are.
    Previously, one would have to capture and parse the warning/error
    messages raised.   This can be used, for example, to customize the
    messages to the expected end-user's knowledge level.

M       embed.fnc
M       embed.h
M       ext/XS-APItest/APItest.xs
M       ext/XS-APItest/t/utf8.t
M       pod/perldelta.pod
M       proto.h
M       utf8.c
M       utf8.h

commit 3757e55dd9611aede000f6f036237c4fda19b4f6
Author: Karl Williamson <k...@cpan.org>
Date:   Sat Oct 8 21:19:18 2016 -0600

    utf8n_to_uvchr(): Make a parameter const

M       embed.fnc
M       proto.h
M       utf8.c

commit 2b5e7bc2e60b4c4b5d87aa66e066363d9dce7930
Author: Karl Williamson <k...@cpan.org>
Date:   Wed Oct 5 19:09:02 2016 -0600

    utf8n_to_uvchr(): Note multiple malformations
    
    Some UTF-8 sequences can have multiple malformations.  For example, a
    sequence can be the start of an overlong representation of a code point,
    and still be incomplete.  Until this commit what was generally done was
    to stop looking when the first malformation was found.  This was not
    correct behavior, as that malformation may be allowed, while another
    unallowed one went unnoticed.  (But this did not actually create
    security holes, as those allowed malformations replaced the input with a
    REPLACEMENT CHARACTER.)  This commit refactors the error handling of
    this function to set a flag and keep going if a malformation is found
    that doesn't preclude others.  Then each is handled in a loop at the
    end, warning if warranted.  The result is that there is a warning for
    each malformation for which warnings should be generated, and an error
    return is made if any one is disallowed.
    
    Overflow doesn't happen except for very high code points, well above the
    Unicode range, and above fitting in 31 bits.  Hence the latter 2
    potential malformations are subsets of overflow, so only one warning is
    output--the most dire.
    
    This will speed up the normal case slightly, as the test for overflow is
    pulled out of the loop, allowing the UV to overflow.  Then a single test
    after the loop is done to see if there was overflow or not.

M       ext/XS-APItest/t/utf8.t
M       pod/perldelta.pod
M       pod/perldiag.pod
M       t/op/utf8decode.t
M       utf8.c
M       utf8.h

commit 1980a0f48b7a9b6e99cda0d5ae69cbb49da3cbf4
Author: Karl Williamson <k...@cpan.org>
Date:   Fri Oct 7 15:07:57 2016 -0600

    APItest/t/utf8.t: Indent a bunch of code
    
    And reflow to fit in 80 columns.  This is in preparation for the next
    commit which will enlocde this new code with two more for loops.
    Several lines that were missing semi-colons have these added (they were
    at the end of nested blocks, so it wasn't an error)

M       ext/XS-APItest/t/utf8.t

commit 1b514755b0094d7e6aff514fe66e91dc20eb5068
Author: Karl Williamson <k...@cpan.org>
Date:   Sat Oct 8 20:53:31 2016 -0600

    APItest/t/utf8.t: Fix improper tests
    
    These two tests are overlong malformations, besides being the ones
    purportedly being tested.  Make them not overlong, so are testing just
    one thing

M       ext/XS-APItest/t/utf8.t

commit ef652b2e6f462d1affda9b4fceb9f412473e128e
Author: Karl Williamson <k...@cpan.org>
Date:   Wed Oct 5 18:34:15 2016 -0600

    APItest/t/utf8.t: Add missing test
    
    Under some circumstances we weren't validating that the generated
    warnings are correct.  This required reordering some 'if' tests, and
    revised special casing of the overflow test.

M       ext/XS-APItest/t/utf8.t

commit 96f5e3aa7b956a03b55043352ff2c96cadcebd61
Author: Karl Williamson <k...@cpan.org>
Date:   Wed Oct 5 18:32:55 2016 -0600

    APItest/t/utf8.t: Rename test for clarity

M       ext/XS-APItest/t/utf8.t

commit 12a4bed3734514e1cda3ce30f0d08a5674d9b7e8
Author: Karl Williamson <k...@cpan.org>
Date:   Sun Oct 2 21:50:10 2016 -0600

    utf8.c: Extract some code into 2 functions
    
    This is in preparation for the same functionality to each be used in a
    new place in a future commit

M       embed.fnc
M       embed.h
M       proto.h
M       utf8.c

commit 197945405cd006523680ded35dabfee10e233d41
Author: Karl Williamson <k...@cpan.org>
Date:   Sun Oct 2 21:31:52 2016 -0600

    utf8.c: Rename a couple of macros for clarity
    
    These were recently added in 2b47960981adadbe81b9635d4ca7861c45ccdced.
    This also removes the #undefs of these in preparation for them to be
    used later in the file.

M       utf8.c

commit 6f89c5a0e6dc613d9b45f50b12f5ad9b69d7a6df
Author: Karl Williamson <k...@cpan.org>
Date:   Sun Oct 2 21:09:27 2016 -0600

    utf8.h: Change some flag definition constants
    
    These #defines give flag bits in a U32.  This commit opens a gap that
    will be filled in a future commit.  A test file has to change to
    correspond, as it duplicates the defines.

M       ext/XS-APItest/t/utf8.t
M       utf8.h

commit d84e92aa0ca24c080daa950e82c6b14870f9e385
Author: Karl Williamson <k...@cpan.org>
Date:   Sun Oct 2 21:05:15 2016 -0600

    APItest/t/utf8.t: Extract code to common function
    
    There are many instances of this simple code to dump an array of trapped
    warning messages.  The problem is that they display better when joined
    by "" rather than by a comma.  Rather than change each instance to do
    that, I changed each instance to a sub call and changed it there.

M       ext/XS-APItest/t/utf8.t

commit 889c88568fd0045a7d4aee82e769d3d130f0335c
Author: Karl Williamson <k...@cpan.org>
Date:   Fri Sep 30 12:42:45 2016 -0600

    utf8.c: Add some UNLIKELY()s
    
    for branch prediction

M       utf8.c

commit 7cf8d05d1e856f3bd3a392b3ccea008f1c1eb743
Author: Karl Williamson <k...@cpan.org>
Date:   Wed Sep 28 15:05:17 2016 -0600

    Add details to UTF-8 malformation error messages
    
    I've long been unsatisfied with the information contained in the
    error/warning messages raised when some input is malformed UTF-8, but
    have been reluctant to change the text in case some one is relying on
    it.  One reason that someone might be parsing the messages is that there
    has been no convenient way to otherwise pin down what the exact
    malformation might be.  A few commits from now will add a facility
    to get the type of malformation unambiguously.  This will be a better
    mechanism to use for those rare modules that need to know what's the
    exact malformation.
    
    So, I will fix and issue pull requests for any module broken by this
    commit.
    
    The messages are changed by now dumping (in \xXY format) the bytes that
    make up the malformed character, and extra details are added in most
    cases.
    
    Messages about overlongs now display the code point they evaluate to and
    what the shortest UTF-8 sequence for generating that code point is.
    
    Messages about overflowing now just display that it overflows, since the
    entire byte sequence is now dumped.  The previous message displayed just
    the byte which was being processed where overflow was detected, but that
    information is not at all meaningfull.

M       embed.fnc
M       embed.h
M       ext/XS-APItest/t/utf8.t
M       lib/utf8.t
M       pod/perldelta.pod
M       pod/perldiag.pod
M       proto.h
M       t/io/utf8.t
M       t/lib/warnings/utf8
M       t/op/pack.t
M       t/op/utf8decode.t
M       utf8.c

commit 806547a7dc29226b6a06672e1d42fb136e766510
Author: Karl Williamson <k...@cpan.org>
Date:   Wed Sep 28 10:19:03 2016 -0600

    utf8.c: Consolidate duplicate error msg text
    
    This text is generated in 2 places; consolidate into one place.

M       embed.fnc
M       embed.h
M       proto.h
M       utf8.c
-----------------------------------------------------------------------

Summary of changes:
 embed.fnc                 |   18 +-
 embed.h                   |    6 +-
 ext/XS-APItest/APItest.xs |   18 +-
 ext/XS-APItest/t/utf8.t   | 1040 ++++++++++++++++++++++++---------------
 lib/utf8.t                |    2 +-
 pod/perldelta.pod         |   19 +-
 pod/perldiag.pod          |    9 +-
 proto.h                   |   25 +-
 t/io/utf8.t               |    2 +-
 t/lib/warnings/utf8       |    6 +-
 t/op/pack.t               |    2 +-
 t/op/utf8decode.t         |   52 +-
 utf8.c                    | 1191 +++++++++++++++++++++++++++++++--------------
 utf8.h                    |   45 +-
 14 files changed, 1624 insertions(+), 811 deletions(-)

diff --git a/embed.fnc b/embed.fnc
index e4c4e30..46426b6 100644
--- a/embed.fnc
+++ b/embed.fnc
@@ -1673,6 +1673,14 @@ ApdD     |UV     |to_utf8_case   |NN const U8 *p         
                        \
                                |NN const char *normal|                         
\
                                NULLOK const char *special
 #if defined(PERL_IN_UTF8_C)
+inRP   |bool   |does_utf8_overflow|NN const U8 * const s|NN const U8 * e
+inRP   |bool   |is_utf8_overlong_given_start_byte_ok|NN const U8 * const 
s|const STRLEN len
+sMR    |char * |unexpected_non_continuation_text                       \
+               |NN const U8 * const s                                  \
+               |const STRLEN print_len                                 \
+               |const STRLEN non_cont_byte_pos                         \
+               |const STRLEN expect_len
+sM     |char * |_byte_dump_string|NN const U8 * s|const STRLEN len
 s      |UV     |_to_utf8_case  |const UV uv1                                   
\
                                |NN const U8 *p                                 
\
                                |NN U8* ustrp                                   
\
@@ -1729,7 +1737,15 @@ Amd      |UV     |utf8_to_uvchr_buf      |NN const U8 
*s|NN const U8 *send|NULLOK STRLEN *retl
 ApdD   |UV     |utf8_to_uvuni_buf      |NN const U8 *s|NN const U8 
*send|NULLOK STRLEN *retlen
 pM     |bool   |check_utf8_print       |NN const U8 *s|const STRLEN len
 
-Adp    |UV     |utf8n_to_uvchr |NN const U8 *s|STRLEN curlen|NULLOK STRLEN 
*retlen|U32 flags
+Adop   |UV     |utf8n_to_uvchr |NN const U8 *s                             \
+                               |STRLEN curlen                              \
+                               |NULLOK STRLEN *retlen                      \
+                               |const U32 flags
+Adp    |UV     |utf8n_to_uvchr_error|NN const U8 *s                        \
+                               |STRLEN curlen                              \
+                               |NULLOK STRLEN *retlen                      \
+                               |const U32 flags                            \
+                               |NULLOK U32 * errors
 AipnR  |UV     |valid_utf8_to_uvchr    |NN const U8 *s|NULLOK STRLEN *retlen
 Ap     |UV     |utf8n_to_uvuni|NN const U8 *s|STRLEN curlen|NULLOK STRLEN 
*retlen|U32 flags
 
diff --git a/embed.h b/embed.h
index 31d0548..5df381c 100644
--- a/embed.h
+++ b/embed.h
@@ -737,7 +737,7 @@
 #define utf8_to_uvchr(a,b)     Perl_utf8_to_uvchr(aTHX_ a,b)
 #define utf8_to_uvuni(a,b)     Perl_utf8_to_uvuni(aTHX_ a,b)
 #define utf8_to_uvuni_buf(a,b,c)       Perl_utf8_to_uvuni_buf(aTHX_ a,b,c)
-#define utf8n_to_uvchr(a,b,c,d)        Perl_utf8n_to_uvchr(aTHX_ a,b,c,d)
+#define utf8n_to_uvchr_error(a,b,c,d,e)        Perl_utf8n_to_uvchr_error(aTHX_ 
a,b,c,d,e)
 #define utf8n_to_uvuni(a,b,c,d)        Perl_utf8n_to_uvuni(aTHX_ a,b,c,d)
 #define uvoffuni_to_utf8_flags(a,b,c)  Perl_uvoffuni_to_utf8_flags(aTHX_ a,b,c)
 #define uvuni_to_utf8(a,b)     Perl_uvuni_to_utf8(aTHX_ a,b)
@@ -1821,13 +1821,17 @@
 #define isa_lookup(a,b,c,d)    S_isa_lookup(aTHX_ a,b,c,d)
 #  endif
 #  if defined(PERL_IN_UTF8_C)
+#define _byte_dump_string(a,b) S__byte_dump_string(aTHX_ a,b)
 #define _to_utf8_case(a,b,c,d,e,f,g)   S__to_utf8_case(aTHX_ a,b,c,d,e,f,g)
 #define check_locale_boundary_crossing(a,b,c,d)        
S_check_locale_boundary_crossing(aTHX_ a,b,c,d)
+#define does_utf8_overflow     S_does_utf8_overflow
 #define is_utf8_common(a,b,c,d)        S_is_utf8_common(aTHX_ a,b,c,d)
 #define is_utf8_cp_above_31_bits       S_is_utf8_cp_above_31_bits
+#define is_utf8_overlong_given_start_byte_ok   
S_is_utf8_overlong_given_start_byte_ok
 #define swash_scan_list_line(a,b,c,d,e,f,g)    S_swash_scan_list_line(aTHX_ 
a,b,c,d,e,f,g)
 #define swatch_get(a,b,c)      S_swatch_get(aTHX_ a,b,c)
 #define to_lower_latin1                S_to_lower_latin1
+#define unexpected_non_continuation_text(a,b,c,d)      
S_unexpected_non_continuation_text(aTHX_ a,b,c,d)
 #  endif
 #  if defined(PERL_IN_UTF8_C) || defined(PERL_IN_PP_C)
 #define _to_upper_title_latin1(a,b,c,d)        
Perl__to_upper_title_latin1(aTHX_ a,b,c,d)
diff --git a/ext/XS-APItest/APItest.xs b/ext/XS-APItest/APItest.xs
index ce94968..6b6f45f 100644
--- a/ext/XS-APItest/APItest.xs
+++ b/ext/XS-APItest/APItest.xs
@@ -1362,7 +1362,7 @@ bytes_cmp_utf8(bytes, utf8)
        RETVAL
 
 AV *
-test_utf8n_to_uvchr(s, len, flags)
+test_utf8n_to_uvchr_error(s, len, flags)
 
         SV *s
         SV *len
@@ -1371,20 +1371,25 @@ test_utf8n_to_uvchr(s, len, flags)
         STRLEN retlen;
         UV ret;
         STRLEN slen;
+        U32 errors;
 
     CODE:
-        /* Call utf8n_to_uvchr() with the inputs.  It always asks for the
-         * actual length to be returned
+        /* Now that utf8n_to_uvchr() is a trivial wrapper for
+         * utf8n_to_uvchr_error(), call the latter with the inputs.  It always
+         * asks for the actual length to be returned and errors to be returned
          *
          * Length to assume <s> is; not checked, so could have buffer overflow
          */
         RETVAL = newAV();
         sv_2mortal((SV*)RETVAL);
 
-        ret
-         = utf8n_to_uvchr((U8*) SvPV(s, slen), SvUV(len), &retlen, 
SvUV(flags));
+        ret = utf8n_to_uvchr_error((U8*) SvPV(s, slen),
+                                         SvUV(len),
+                                         &retlen,
+                                         SvUV(flags),
+                                         &errors);
 
-        /* Returns the return value in [0]; <retlen> in [1] */
+        /* Returns the return value in [0]; <retlen> in [1], <errors> in [2] */
         av_push(RETVAL, newSVuv(ret));
         if (retlen == (STRLEN) -1) {
             av_push(RETVAL, newSViv(-1));
@@ -1392,6 +1397,7 @@ test_utf8n_to_uvchr(s, len, flags)
         else {
             av_push(RETVAL, newSVuv(retlen));
         }
+        av_push(RETVAL, newSVuv(errors));
 
     OUTPUT:
         RETVAL
diff --git a/ext/XS-APItest/t/utf8.t b/ext/XS-APItest/t/utf8.t
index 32f60e0..0f2d9ee 100644
--- a/ext/XS-APItest/t/utf8.t
+++ b/ext/XS-APItest/t/utf8.t
@@ -8,7 +8,7 @@ no warnings 'deprecated'; # Some of the below are above IV_MAX 
on 32 bit
                           # machines, and that is tested elsewhere
 
 use XS::APItest;
-
+use Data::Dumper;
 my $pound_sign = chr utf8::unicode_to_native(163);
 
 sub isASCII { ord "A" == 65 }
@@ -21,6 +21,10 @@ sub display_bytes {
            . '"';
 }
 
+sub output_warnings(@) {
+    diag "The warnings were:\n" . join("", @_);
+}
+
 # This  test file can't use byte_utf8a_to_utf8n() from t/charset_tools.pl
 # because that uses the same functions we are testing here.  So UTF-EBCDIC
 # strings are hard-coded as I8 strings in this file instead, and we use array
@@ -59,26 +63,56 @@ for (my $i = 0; $i < 256; $i++) {
                     ? sub { return shift }
                     : sub { return join "", map { chr $native_to_i8[ord $_] }
                                             split "", shift };
+sub start_byte_to_cont($) {
+
+    # Extract the code point information from the input UTF-8 start byte, and
+    # return a continuation byte containing the same information.  This is
+    # used in constructing an overlong malformation from valid input.
+
+    my $byte = shift;
+    my $len = test_UTF8_SKIP($byte);
+    if ($len < 2) {
+        die "";
+    }
+
+    $byte = ord native_to_I8($byte);
+
+    # Copied from utf8.h.  This gets rid of the leading 1 bits.
+    $byte &= ((($len) >= 7) ? 0x00 : (0x1F >> (($len)-2)));
+
+    $byte |= (isASCII) ? 0x80 : ord I8_to_native("\xA0");
+    return chr $byte;
+}
 
 my $is64bit = length sprintf("%x", ~0) > 8;
 
 
-# Test utf8n_to_uvchr().  These provide essentially complete code coverage.
-# Copied from utf8.h
+# Test utf8n_to_uvchr_error().  These provide essentially complete code
+# coverage.  Copied from utf8.h
 my $UTF8_ALLOW_EMPTY            = 0x0001;
+my $UTF8_GOT_EMPTY              = $UTF8_ALLOW_EMPTY;
 my $UTF8_ALLOW_CONTINUATION     = 0x0002;
+my $UTF8_GOT_CONTINUATION       = $UTF8_ALLOW_CONTINUATION;
 my $UTF8_ALLOW_NON_CONTINUATION = 0x0004;
+my $UTF8_GOT_NON_CONTINUATION   = $UTF8_ALLOW_NON_CONTINUATION;
 my $UTF8_ALLOW_SHORT            = 0x0008;
+my $UTF8_GOT_SHORT              = $UTF8_ALLOW_SHORT;
 my $UTF8_ALLOW_LONG             = 0x0010;
-my $UTF8_DISALLOW_SURROGATE     = 0x0020;
-my $UTF8_WARN_SURROGATE         = 0x0040;
-my $UTF8_DISALLOW_NONCHAR       = 0x0080;
-my $UTF8_WARN_NONCHAR           = 0x0100;
-my $UTF8_DISALLOW_SUPER         = 0x0200;
-my $UTF8_WARN_SUPER             = 0x0400;
-my $UTF8_DISALLOW_ABOVE_31_BIT  = 0x0800;
-my $UTF8_WARN_ABOVE_31_BIT      = 0x1000;
-my $UTF8_CHECK_ONLY             = 0x2000;
+my $UTF8_GOT_LONG               = $UTF8_ALLOW_LONG;
+my $UTF8_GOT_OVERFLOW           = 0x0020;
+my $UTF8_DISALLOW_SURROGATE     = 0x0040;
+my $UTF8_GOT_SURROGATE          = $UTF8_DISALLOW_SURROGATE;
+my $UTF8_WARN_SURROGATE         = 0x0080;
+my $UTF8_DISALLOW_NONCHAR       = 0x0100;
+my $UTF8_GOT_NONCHAR            = $UTF8_DISALLOW_NONCHAR;
+my $UTF8_WARN_NONCHAR           = 0x0200;
+my $UTF8_DISALLOW_SUPER         = 0x0400;
+my $UTF8_GOT_SUPER              = $UTF8_DISALLOW_SUPER;
+my $UTF8_WARN_SUPER             = 0x0800;
+my $UTF8_DISALLOW_ABOVE_31_BIT  = 0x1000;
+my $UTF8_GOT_ABOVE_31_BIT       = $UTF8_DISALLOW_ABOVE_31_BIT;
+my $UTF8_WARN_ABOVE_31_BIT      = 0x2000;
+my $UTF8_CHECK_ONLY             = 0x4000;
 my $UTF8_DISALLOW_ILLEGAL_C9_INTERCHANGE
                              = $UTF8_DISALLOW_SUPER|$UTF8_DISALLOW_SURROGATE;
 my $UTF8_DISALLOW_ILLEGAL_INTERCHANGE
@@ -445,7 +479,7 @@ for my $u (sort { utf8::unicode_to_native($a) <=> 
utf8::unicode_to_native($b) }
         unless (is(scalar @warnings, 0,
                 "   Verify is_utf8_valid_partial_char_flags generated no 
warnings"))
         {
-            diag "The warnings were: " . join(", ", @warnings);
+            output_warnings(@warnings);
         }
 
         my $b = substr($n_chr, $j, 1);
@@ -535,19 +569,23 @@ for my $u (sort { utf8::unicode_to_native($a) <=> 
utf8::unicode_to_native($b) }
 
     my $display_flags = sprintf "0x%x", $this_utf8_flags;
     my $display_bytes = display_bytes($bytes);
-    my $ret_ref = test_utf8n_to_uvchr($bytes, $len, $this_utf8_flags);
+    my $ret_ref = test_utf8n_to_uvchr_error($bytes, $len, $this_utf8_flags);
 
     # Rest of tests likely meaningless if it gets the wrong code point.
     next unless is($ret_ref->[0], $n,
-       "Verify utf8n_to_uvchr($display_bytes, $display_flags) returns $hex_n");
+                   "Verify utf8n_to_uvchr_error($display_bytes, 
$display_flags)"
+                 . "returns $hex_n");
     is($ret_ref->[1], $len,
-       "Verify utf8n_to_uvchr() for $hex_n returns expected length: $len");
+       "Verify utf8n_to_uvchr_error() for $hex_n returns expected length:"
+     . " $len");
 
     unless (is(scalar @warnings, 0,
-               "Verify utf8n_to_uvchr() for $hex_n generated no warnings"))
+             "Verify utf8n_to_uvchr_error() for $hex_n generated no warnings"))
     {
-        diag "The warnings were: " . join(", ", @warnings);
+        output_warnings(@warnings);
     }
+    is($ret_ref->[2], 0,
+       "Verify utf8n_to_uvchr_error() returned no error bits");
 
     undef @warnings;
 
@@ -557,7 +595,7 @@ for my $u (sort { utf8::unicode_to_native($a) <=> 
utf8::unicode_to_native($b) }
     unless (is(scalar @warnings, 0,
                "Verify isUTF8_CHAR() for $hex_n generated no warnings"))
     {
-        diag "The warnings were: " . join(", ", @warnings);
+        output_warnings(@warnings);
     }
 
     undef @warnings;
@@ -568,7 +606,7 @@ for my $u (sort { utf8::unicode_to_native($a) <=> 
utf8::unicode_to_native($b) }
     unless (is(scalar @warnings, 0,
                "Verify isUTF8_CHAR() generated no warnings"))
     {
-        diag "The warnings were: " . join(", ", @warnings);
+        output_warnings(@warnings);
     }
 
     undef @warnings;
@@ -579,7 +617,7 @@ for my $u (sort { utf8::unicode_to_native($a) <=> 
utf8::unicode_to_native($b) }
     unless (is(scalar @warnings, 0,
                "Verify isUTF8_CHAR_flags() for $hex_n generated no warnings"))
     {
-        diag "The warnings were: " . join(", ", @warnings);
+        output_warnings(@warnings);
     }
 
     undef @warnings;
@@ -590,7 +628,7 @@ for my $u (sort { utf8::unicode_to_native($a) <=> 
utf8::unicode_to_native($b) }
     unless (is(scalar @warnings, 0,
                "Verify isUTF8_CHAR_flags() generated no warnings"))
     {
-        diag "The warnings were: " . join(", ", @warnings);
+        output_warnings(@warnings);
     }
 
     undef @warnings;
@@ -602,7 +640,7 @@ for my $u (sort { utf8::unicode_to_native($a) <=> 
utf8::unicode_to_native($b) }
     unless (is(scalar @warnings, 0,
                "Verify isSTRICT_UTF8_CHAR() for $hex_n generated no warnings"))
     {
-        diag "The warnings were: " . join(", ", @warnings);
+        output_warnings(@warnings);
     }
 
     undef @warnings;
@@ -613,7 +651,7 @@ for my $u (sort { utf8::unicode_to_native($a) <=> 
utf8::unicode_to_native($b) }
     unless (is(scalar @warnings, 0,
                "Verify isSTRICT_UTF8_CHAR() generated no warnings"))
     {
-        diag "The warnings were: " . join(", ", @warnings);
+        output_warnings(@warnings);
     }
 
     undef @warnings;
@@ -624,7 +662,7 @@ for my $u (sort { utf8::unicode_to_native($a) <=> 
utf8::unicode_to_native($b) }
     unless (is(scalar @warnings, 0,
                "Verify isUTF8_CHAR() for $hex_n generated no warnings"))
     {
-        diag "The warnings were: " . join(", ", @warnings);
+        output_warnings(@warnings);
     }
 
     undef @warnings;
@@ -636,7 +674,7 @@ for my $u (sort { utf8::unicode_to_native($a) <=> 
utf8::unicode_to_native($b) }
     unless (is(scalar @warnings, 0,
                "Verify isC9_STRICT_UTF8_CHAR() for $hex_n generated no 
warnings"))
     {
-        diag "The warnings were: " . join(", ", @warnings);
+        output_warnings(@warnings);
     }
 
     undef @warnings;
@@ -647,7 +685,7 @@ for my $u (sort { utf8::unicode_to_native($a) <=> 
utf8::unicode_to_native($b) }
     unless (is(scalar @warnings, 0,
                "Verify isC9_STRICT_UTF8_CHAR() generated no warnings"))
     {
-        diag "The warnings were: " . join(", ", @warnings);
+        output_warnings(@warnings);
     }
 
     undef @warnings;
@@ -658,7 +696,7 @@ for my $u (sort { utf8::unicode_to_native($a) <=> 
utf8::unicode_to_native($b) }
     unless (is(scalar @warnings, 0,
                "Verify isUTF8_CHAR() for $hex_n generated no warnings"))
     {
-        diag "The warnings were: " . join(", ", @warnings);
+        output_warnings(@warnings);
     }
 
     undef @warnings;
@@ -670,7 +708,7 @@ for my $u (sort { utf8::unicode_to_native($a) <=> 
utf8::unicode_to_native($b) }
     unless (is(scalar @warnings, 0,
                "Verify valid_utf8_to_uvchr() for $hex_n generated no 
warnings"))
     {
-        diag "The warnings were: " . join(", ", @warnings);
+        output_warnings(@warnings);
     }
 
     # Similarly for uvchr_to_utf8
@@ -699,7 +737,7 @@ for my $u (sort { utf8::unicode_to_native($a) <=> 
utf8::unicode_to_native($b) }
     unless (is(scalar @warnings, 0,
         "Verify uvchr_to_utf8_flags($hex_n, $display_flags) for $hex_n 
generated no warnings"))
     {
-        diag "The warnings were: " . join(", ", @warnings);
+        output_warnings(@warnings);
     }
 
     # Now append this code point to a string that we will test various
@@ -975,139 +1013,139 @@ my $REPLACEMENT = 0xFFFD;
 # Now test the malformations.  All these raise category utf8 warnings.
 my @malformations = (
     [ "zero length string malformation", "", 0,
-        $UTF8_ALLOW_EMPTY, 0, 0,
+        $UTF8_ALLOW_EMPTY, $UTF8_GOT_EMPTY, 0, 0,
         qr/empty string/
     ],
     [ "orphan continuation byte malformation", I8_to_native("${I8c}a"),
         2,
-        $UTF8_ALLOW_CONTINUATION, $REPLACEMENT, 1,
+        $UTF8_ALLOW_CONTINUATION, $UTF8_GOT_CONTINUATION, $REPLACEMENT, 1,
         qr/unexpected continuation byte/
     ],
     [ "premature next character malformation (immediate)",
         (isASCII) ? "\xc2\xc2\x80" : I8_to_native("\xc5\xc5\xa0"),
         3,
-        $UTF8_ALLOW_NON_CONTINUATION, $REPLACEMENT, 1,
+        $UTF8_ALLOW_NON_CONTINUATION, $UTF8_GOT_NON_CONTINUATION, 
$REPLACEMENT, 1,
         qr/unexpected non-continuation byte.*immediately after start byte/
     ],
     [ "premature next character malformation (non-immediate)",
-        I8_to_native("\xf0${I8c}a"),
+        I8_to_native("\xf1${I8c}a"),
         3,
-        $UTF8_ALLOW_NON_CONTINUATION, $REPLACEMENT, 2,
+        $UTF8_ALLOW_NON_CONTINUATION, $UTF8_GOT_NON_CONTINUATION, 
$REPLACEMENT, 2,
         qr/unexpected non-continuation byte .* 2 bytes after start byte/
     ],
-    [ "too short malformation", I8_to_native("\xf0${I8c}a"), 2,
+    [ "too short malformation", I8_to_native("\xf1${I8c}a"), 2,
         # Having the 'a' after this, but saying there are only 2 bytes also
         # tests that we pay attention to the passed in length
-        $UTF8_ALLOW_SHORT, $REPLACEMENT, 2,
+        $UTF8_ALLOW_SHORT, $UTF8_GOT_SHORT, $REPLACEMENT, 2,
         qr/2 bytes, need 4/
     ],
     [ "overlong malformation, lowest 2-byte",
         (isASCII) ? "\xc0\x80" : I8_to_native("\xc0\xa0"),
         2,
-        $UTF8_ALLOW_LONG,
+        $UTF8_ALLOW_LONG, $UTF8_GOT_LONG,
         0,   # NUL
         2,
-        qr/2 bytes, need 1/
+        qr/overlong/
     ],
     [ "overlong malformation, highest 2-byte",
         (isASCII) ? "\xc1\xbf" : I8_to_native("\xc4\xbf"),
         2,
-        $UTF8_ALLOW_LONG,
+        $UTF8_ALLOW_LONG, $UTF8_GOT_LONG,
         (isASCII) ? 0x7F : utf8::unicode_to_native(0xBF),
         2,
-        qr/2 bytes, need 1/
+        qr/overlong/
     ],
     [ "overlong malformation, lowest 3-byte",
         (isASCII) ? "\xe0\x80\x80" : I8_to_native("\xe0\xa0\xa0"),
         3,
-        $UTF8_ALLOW_LONG,
+        $UTF8_ALLOW_LONG, $UTF8_GOT_LONG,
         0,   # NUL
         3,
-        qr/3 bytes, need 1/
+        qr/overlong/
     ],
     [ "overlong malformation, highest 3-byte",
         (isASCII) ? "\xe0\x9f\xbf" : I8_to_native("\xe0\xbf\xbf"),
         3,
-        $UTF8_ALLOW_LONG,
+        $UTF8_ALLOW_LONG, $UTF8_GOT_LONG,
         (isASCII) ? 0x7FF : 0x3FF,
         3,
-        qr/3 bytes, need 2/
+        qr/overlong/
     ],
     [ "overlong malformation, lowest 4-byte",
         (isASCII) ? "\xf0\x80\x80\x80" : I8_to_native("\xf0\xa0\xa0\xa0"),
         4,
-        $UTF8_ALLOW_LONG,
+        $UTF8_ALLOW_LONG, $UTF8_GOT_LONG,
         0,   # NUL
         4,
-        qr/4 bytes, need 1/
+        qr/overlong/
     ],
     [ "overlong malformation, highest 4-byte",
         (isASCII) ? "\xf0\x8F\xbf\xbf" : I8_to_native("\xf0\xaf\xbf\xbf"),
         4,
-        $UTF8_ALLOW_LONG,
+        $UTF8_ALLOW_LONG, $UTF8_GOT_LONG,
         (isASCII) ? 0xFFFF : 0x3FFF,
         4,
-        qr/4 bytes, need 3/
+        qr/overlong/
     ],
     [ "overlong malformation, lowest 5-byte",
         (isASCII)
          ?              "\xf8\x80\x80\x80\x80"
          : I8_to_native("\xf8\xa0\xa0\xa0\xa0"),
         5,
-        $UTF8_ALLOW_LONG,
+        $UTF8_ALLOW_LONG, $UTF8_GOT_LONG,
         0,   # NUL
         5,
-        qr/5 bytes, need 1/
+        qr/overlong/
     ],
     [ "overlong malformation, highest 5-byte",
         (isASCII)
          ?              "\xf8\x87\xbf\xbf\xbf"
          : I8_to_native("\xf8\xa7\xbf\xbf\xbf"),
         5,
-        $UTF8_ALLOW_LONG,
+        $UTF8_ALLOW_LONG, $UTF8_GOT_LONG,
         (isASCII) ? 0x1FFFFF : 0x3FFFF,
         5,
-        qr/5 bytes, need 4/
+        qr/overlong/
     ],
     [ "overlong malformation, lowest 6-byte",
         (isASCII)
          ?              "\xfc\x80\x80\x80\x80\x80"
          : I8_to_native("\xfc\xa0\xa0\xa0\xa0\xa0"),
         6,
-        $UTF8_ALLOW_LONG,
+        $UTF8_ALLOW_LONG, $UTF8_GOT_LONG,
         0,   # NUL
         6,
-        qr/6 bytes, need 1/
+        qr/overlong/
     ],
     [ "overlong malformation, highest 6-byte",
         (isASCII)
          ?              "\xfc\x83\xbf\xbf\xbf\xbf"
          : I8_to_native("\xfc\xa3\xbf\xbf\xbf\xbf"),
         6,
-        $UTF8_ALLOW_LONG,
+        $UTF8_ALLOW_LONG, $UTF8_GOT_LONG,
         (isASCII) ? 0x3FFFFFF : 0x3FFFFF,
         6,
-        qr/6 bytes, need 5/
+        qr/overlong/
     ],
     [ "overlong malformation, lowest 7-byte",
         (isASCII)
          ?              "\xfe\x80\x80\x80\x80\x80\x80"
          : I8_to_native("\xfe\xa0\xa0\xa0\xa0\xa0\xa0"),
         7,
-        $UTF8_ALLOW_LONG,
+        $UTF8_ALLOW_LONG, $UTF8_GOT_LONG,
         0,   # NUL
         7,
-        qr/7 bytes, need 1/
+        qr/overlong/
     ],
     [ "overlong malformation, highest 7-byte",
         (isASCII)
          ?              "\xfe\x81\xbf\xbf\xbf\xbf\xbf"
          : I8_to_native("\xfe\xa1\xbf\xbf\xbf\xbf\xbf"),
         7,
-        $UTF8_ALLOW_LONG,
+        $UTF8_ALLOW_LONG, $UTF8_GOT_LONG,
         (isASCII) ? 0x7FFFFFFF : 0x3FFFFFF,
         7,
-        qr/7 bytes, need 6/
+        qr/overlong/
     ],
 );
 
@@ -1118,17 +1156,19 @@ if (isASCII && ! $is64bit) {    # 32-bit ASCII platform
             "\xfe\x84\x80\x80\x80\x80\x80",  # Represents 2**32
             7,
             0,  # There is no way to allow this malformation
+            $UTF8_GOT_OVERFLOW,
             $REPLACEMENT,
             7,
-            qr/overflow/
+            qr/overflows/
         ],
         [ "overflow malformation, can tell on first byte",
             "\xff\x80\x80\x80\x80\x80\x81\x80\x80\x80\x80\x80\x80",
             13,
             0,  # There is no way to allow this malformation
+            $UTF8_GOT_OVERFLOW,
             $REPLACEMENT,
             13,
-            qr/overflow/
+            qr/overflows/
         ];
 }
 else {
@@ -1143,20 +1183,20 @@ else {
              ?              
"\xff\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80"
              : 
I8_to_native("\xff\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0"),
             (isASCII) ? 13 : 14,
-            $UTF8_ALLOW_LONG,
+            $UTF8_ALLOW_LONG, $UTF8_GOT_LONG,
             0,   # NUL
             (isASCII) ? 13 : 14,
-            qr/1[34] bytes, need 1/,    # 1[34] to work on either ASCII or 
EBCDIC
+            qr/overlong/,
         ],
         [ "overlong malformation, highest max-byte",
             (isASCII)    # 2**36-1 on ASCII; 2**30-1 on EBCDIC
              ?              
"\xff\x80\x80\x80\x80\x80\x80\xbf\xbf\xbf\xbf\xbf\xbf"
              : 
I8_to_native("\xff\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xbf\xbf\xbf\xbf\xbf\xbf"),
             (isASCII) ? 13 : 14,
-            $UTF8_ALLOW_LONG,
+            $UTF8_ALLOW_LONG, $UTF8_GOT_LONG,
             (isASCII) ? 0xFFFFFFFFF : 0x3FFFFFFF,
             (isASCII) ? 13 : 14,
-            qr/1[34] bytes, need 7/,
+            qr/overlong/,
         ];
 
     if (! $is64bit) {   # 32-bit EBCDIC
@@ -1165,9 +1205,10 @@ else {
             
I8_to_native("\xff\xa0\xa0\xa0\xa0\xa0\xa0\xa4\xa0\xa0\xa0\xa0\xa0\xa0"),
             14,
             0,  # There is no way to allow this malformation
+            $UTF8_GOT_OVERFLOW,
             $REPLACEMENT,
             14,
-            qr/overflow/
+            qr/overflows/
         ];
     }
     else {  # 64-bit
@@ -1178,17 +1219,25 @@ else {
                 : 
I8_to_native("\xff\xb0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0"),
                 (isASCII) ? 13 : 14,
                 0,  # There is no way to allow this malformation
+                $UTF8_GOT_OVERFLOW,
                 $REPLACEMENT,
                 (isASCII) ? 13 : 14,
-                qr/overflow/
+                qr/overflows/
             ];
     }
 }
 
 foreach my $test (@malformations) {
-    my ($testname, $bytes, $length, $allow_flags, $allowed_uv, $expected_len, 
$message ) = @$test;
-
-    next if ! ok(length($bytes) >= $length, "$testname: Make sure won't read 
beyond buffer: " . length($bytes) . " >= $length");
+    my ($testname, $bytes, $length, $allow_flags, $expected_error_flags,
+        $allowed_uv, $expected_len, $message ) = @$test;
+
+    if (length($bytes) < $length) {
+        fail("Internal test error: actual buffer length (" . length($bytes)
+           . ") must be at least as high as how far we are allowed to read"
+           . " into it ($length)");
+        diag($testname);
+        next;
+    }
 
     undef @warnings;
 
@@ -1197,7 +1246,7 @@ foreach my $test (@malformations) {
     unless (is(scalar @warnings, 0,
                "$testname: isUTF8_CHAR() generated no warnings"))
     {
-        diag "The warnings were: " . join(", ", @warnings);
+        output_warnings(@warnings);
     }
 
     undef @warnings;
@@ -1207,7 +1256,7 @@ foreach my $test (@malformations) {
     unless (is(scalar @warnings, 0,
                "$testname: isUTF8_CHAR() generated no warnings"))
     {
-        diag "The warnings were: " . join(", ", @warnings);
+        output_warnings(@warnings);
     }
 
     $ret = test_isSTRICT_UTF8_CHAR($bytes, $length);
@@ -1215,7 +1264,7 @@ foreach my $test (@malformations) {
     unless (is(scalar @warnings, 0,
                "$testname: isSTRICT_UTF8_CHAR() generated no warnings"))
     {
-        diag "The warnings were: " . join(", ", @warnings);
+        output_warnings(@warnings);
     }
 
     $ret = test_isC9_STRICT_UTF8_CHAR($bytes, $length);
@@ -1223,7 +1272,7 @@ foreach my $test (@malformations) {
     unless (is(scalar @warnings, 0,
                "$testname: isC9_STRICT_UTF8_CHAR() generated no warnings"))
     {
-        diag "The warnings were: " . join(", ", @warnings);
+        output_warnings(@warnings);
     }
 
     for my $j (1 .. $length - 1) {
@@ -1234,9 +1283,15 @@ foreach my $test (@malformations) {
         $ret = test_is_utf8_valid_partial_char_flags($bytes, $j, 0);
         my $ret_should_be = 0;
         my $comment = "";
-        if ($testname =~ /premature|short/ && $j < 2) {
-            $ret_should_be = 1;
-            $comment = ", but need 2 bytes to discern:";
+        if ($testname =~ /premature|short/ && $j < 3) {
+
+            # The tests are hard-coded so these relationships hold
+            my $cut_off = 2;
+            $cut_off = 3 if $testname =~ /non-immediate/;
+            if ($j < $cut_off) {
+                $ret_should_be = 1;
+                $comment = ", but need $cut_off bytes to discern:";
+            }
         }
         elsif ($testname =~ /overlong/ && ! isASCII && $length == 3) {
             # 3-byte overlongs on EBCDIC are determinable on the first byte
@@ -1271,56 +1326,103 @@ foreach my $test (@malformations) {
         unless (is(scalar @warnings, 0,
                 "$testname: is_utf8_valid_partial_char_flags() generated no 
warnings"))
         {
-            diag "The warnings were: " . join(", ", @warnings);
+            output_warnings(@warnings);
         }
     }
 
 
     # Test what happens when this malformation is not allowed
     undef @warnings;
-    my $ret_ref = test_utf8n_to_uvchr($bytes, $length, 0);
+    my $ret_ref = test_utf8n_to_uvchr_error($bytes, $length, 0);
     is($ret_ref->[0], 0, "$testname: disallowed: Returns 0");
-    is($ret_ref->[1], $expected_len, "$testname: utf8n_to_uvchr(), disallowed: 
Returns expected length: $expected_len");
-    if (is(scalar @warnings, 1, "$testname: disallowed: Got a single warning 
")) {
-        like($warnings[0], $message, "$testname: disallowed: Got expected 
warning");
+    is($ret_ref->[1], $expected_len,
+       "$testname: utf8n_to_uvchr_error(), disallowed: Returns expected"
+     . " length: $expected_len");
+    if (is(scalar @warnings, 1,
+           "$testname: disallowed: Got a single warning "))
+    {
+        like($warnings[0], $message,
+             "$testname: disallowed: Got expected warning");
     }
     else {
         if (scalar @warnings) {
-            diag "The warnings were: " . join(", ", @warnings);
+            output_warnings(@warnings);
         }
     }
+    is($ret_ref->[2], $expected_error_flags,
+       "$testname: utf8n_to_uvchr_error(), disallowed:"
+     . " Returns expected error");
 
     {   # Next test when disallowed, and warnings are off.
         undef @warnings;
         no warnings 'utf8';
-        my $ret_ref = test_utf8n_to_uvchr($bytes, $length, 0);
-        is($ret_ref->[0], 0, "$testname: utf8n_to_uvchr(), disallowed: no 
warnings 'utf8': Returns 0");
-        is($ret_ref->[1], $expected_len, "$testname: utf8n_to_uvchr(), 
disallowed: no warnings 'utf8': Returns expected length: $expected_len");
-        if (!is(scalar @warnings, 0, "$testname: utf8n_to_uvchr(), disallowed: 
no warnings 'utf8': no warnings generated")) {
-            diag "The warnings were: " . join(", ", @warnings);
+        my $ret_ref = test_utf8n_to_uvchr_error($bytes, $length, 0);
+        is($ret_ref->[0], 0,
+           "$testname: utf8n_to_uvchr_error(), disallowed: no warnings 'utf8':"
+         . " Returns 0");
+        is($ret_ref->[1], $expected_len,
+           "$testname: utf8n_to_uvchr_error(), disallowed: no warnings 'utf8':"
+         . " Returns expected length: $expected_len");
+        if (!is(scalar @warnings, 0,
+            "$testname: utf8n_to_uvchr_error(), disallowed: no warnings 
'utf8':"
+          . " no warnings generated"))
+        {
+            output_warnings(@warnings);
         }
+        is($ret_ref->[2], $expected_error_flags,
+           "$testname: utf8n_to_uvchr_error(), disallowed: Returns"
+         . " expected error");
     }
 
     # Test with CHECK_ONLY
     undef @warnings;
-    $ret_ref = test_utf8n_to_uvchr($bytes, $length, $UTF8_CHECK_ONLY);
+    $ret_ref = test_utf8n_to_uvchr_error($bytes, $length, $UTF8_CHECK_ONLY);
     is($ret_ref->[0], 0, "$testname: CHECK_ONLY: Returns 0");
     is($ret_ref->[1], -1, "$testname: CHECK_ONLY: returns -1 for length");
     if (! is(scalar @warnings, 0, "$testname: CHECK_ONLY: no warnings 
generated")) {
-        diag "The warnings were: " . join(", ", @warnings);
+        output_warnings(@warnings);
     }
+    is($ret_ref->[2], $expected_error_flags,
+       "$testname: utf8n_to_uvchr_error(), disallowed: Returns expected"
+     . " error");
 
     next if $allow_flags == 0;    # Skip if can't allow this malformation
 
     # Test when the malformation is allowed
     undef @warnings;
-    $ret_ref = test_utf8n_to_uvchr($bytes, $length, $allow_flags);
-    is($ret_ref->[0], $allowed_uv, "$testname: utf8n_to_uvchr(), allowed: 
Returns expected uv: " . sprintf("0x%04X", $allowed_uv));
-    is($ret_ref->[1], $expected_len, "$testname: utf8n_to_uvchr(), allowed: 
Returns expected length: $expected_len");
-    if (!is(scalar @warnings, 0, "$testname: utf8n_to_uvchr(), allowed: no 
warnings generated"))
+    $ret_ref = test_utf8n_to_uvchr_error($bytes, $length, $allow_flags);
+    is($ret_ref->[0], $allowed_uv,
+       "$testname: utf8n_to_uvchr_error(), allowed: Returns expected uv: "
+     . sprintf("0x%04X", $allowed_uv));
+    is($ret_ref->[1], $expected_len,
+       "$testname: utf8n_to_uvchr_error(), allowed: Returns expected length:"
+     . " $expected_len");
+    if (!is(scalar @warnings, 0,
+            "$testname: utf8n_to_uvchr_error(), allowed: no warnings"
+          . " generated"))
     {
-        diag "The warnings were: " . join(", ", @warnings);
+        output_warnings(@warnings);
     }
+    is($ret_ref->[2], $expected_error_flags,
+       "$testname: utf8n_to_uvchr_error(), disallowed: Returns"
+     . " expected error");
+}
+
+sub nonportable_regex ($) {
+
+    # Returns a pattern that matches the non-portable message raised either
+    # for the specific input code point, or the one generated when there
+    # is some malformation that precludes the message containing the specific
+    # code point
+
+    my $code_point = shift;
+
+    my $string = sprintf '(Code point 0x%x is not Unicode, and'
+                       . '|Any UTF-8 sequence that starts with'
+                       . ' "(\\\x[[:xdigit:]]{2})+" is for a'
+                       . ' non-Unicode code point, and is) not portable',
+                    $code_point;
+    return qr/$string/;
 }
 
 # Now test the cases where a legal code point is generated, but may or may not
@@ -1328,289 +1430,289 @@ foreach my $test (@malformations) {
 my @tests = (
     [ "lowest surrogate",
         (isASCII) ? "\xed\xa0\x80" : I8_to_native("\xf1\xb6\xa0\xa0"),
-        $UTF8_WARN_SURROGATE, $UTF8_DISALLOW_SURROGATE,
+        $UTF8_WARN_SURROGATE, $UTF8_DISALLOW_SURROGATE, $UTF8_GOT_SURROGATE,
         'surrogate', 0xD800,
         (isASCII) ? 3 : 4,
         qr/surrogate/
     ],
     [ "a middle surrogate",
         (isASCII) ? "\xed\xa4\x8d" : I8_to_native("\xf1\xb6\xa8\xad"),
-        $UTF8_WARN_SURROGATE, $UTF8_DISALLOW_SURROGATE,
+        $UTF8_WARN_SURROGATE, $UTF8_DISALLOW_SURROGATE, $UTF8_GOT_SURROGATE,
         'surrogate', 0xD90D,
         (isASCII) ? 3 : 4,
         qr/surrogate/
     ],
     [ "highest surrogate",
         (isASCII) ? "\xed\xbf\xbf" : I8_to_native("\xf1\xb7\xbf\xbf"),
-        $UTF8_WARN_SURROGATE, $UTF8_DISALLOW_SURROGATE,
+        $UTF8_WARN_SURROGATE, $UTF8_DISALLOW_SURROGATE, $UTF8_GOT_SURROGATE,
         'surrogate', 0xDFFF,
         (isASCII) ? 3 : 4,
         qr/surrogate/
     ],
     [ "first non_unicode",
         (isASCII) ? "\xf4\x90\x80\x80" : I8_to_native("\xf9\xa2\xa0\xa0\xa0"),
-        $UTF8_WARN_SUPER, $UTF8_DISALLOW_SUPER,
+        $UTF8_WARN_SUPER, $UTF8_DISALLOW_SUPER, $UTF8_GOT_SUPER,
         'non_unicode', 0x110000,
         (isASCII) ? 4 : 5,
-        qr/not Unicode.* may not be portable/
+        qr/(not Unicode|for a non-Unicode code point).* may not be portable/
     ],
     [ "non_unicode whose first byte tells that",
         (isASCII) ? "\xf5\x80\x80\x80" : I8_to_native("\xfa\xa0\xa0\xa0\xa0"),
-        $UTF8_WARN_SUPER, $UTF8_DISALLOW_SUPER,
+        $UTF8_WARN_SUPER, $UTF8_DISALLOW_SUPER, $UTF8_GOT_SUPER,
         'non_unicode',
         (isASCII) ? 0x140000 : 0x200000,
         (isASCII) ? 4 : 5,
-        qr/not Unicode.* may not be portable/
+        qr/(not Unicode|for a non-Unicode code point).* may not be portable/
     ],
     [ "first of 32 consecutive non-character code points",
         (isASCII) ? "\xef\xb7\x90" : I8_to_native("\xf1\xbf\xae\xb0"),
-        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR,
+        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR, $UTF8_GOT_NONCHAR,
         'nonchar', 0xFDD0,
         (isASCII) ? 3 : 4,
         qr/Unicode non-character.*is not recommended for open interchange/
     ],
     [ "a mid non-character code point of the 32 consecutive ones",
         (isASCII) ? "\xef\xb7\xa0" : I8_to_native("\xf1\xbf\xaf\xa0"),
-        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR,
+        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR, $UTF8_GOT_NONCHAR,
         'nonchar', 0xFDE0,
         (isASCII) ? 3 : 4,
         qr/Unicode non-character.*is not recommended for open interchange/
     ],
     [ "final of 32 consecutive non-character code points",
         (isASCII) ? "\xef\xb7\xaf" : I8_to_native("\xf1\xbf\xaf\xaf"),
-        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR,
+        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR, $UTF8_GOT_NONCHAR,
         'nonchar', 0xFDEF,
         (isASCII) ? 3 : 4,
         qr/Unicode non-character.*is not recommended for open interchange/
     ],
     [ "non-character code point U+FFFE",
         (isASCII) ? "\xef\xbf\xbe" : I8_to_native("\xf1\xbf\xbf\xbe"),
-        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR,
+        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR, $UTF8_GOT_NONCHAR,
         'nonchar', 0xFFFE,
         (isASCII) ? 3 : 4,
         qr/Unicode non-character.*is not recommended for open interchange/
     ],
     [ "non-character code point U+FFFF",
         (isASCII) ? "\xef\xbf\xbf" : I8_to_native("\xf1\xbf\xbf\xbf"),
-        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR,
+        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR, $UTF8_GOT_NONCHAR,
         'nonchar', 0xFFFF,
         (isASCII) ? 3 : 4,
         qr/Unicode non-character.*is not recommended for open interchange/
     ],
     [ "non-character code point U+1FFFE",
         (isASCII) ? "\xf0\x9f\xbf\xbe" : I8_to_native("\xf3\xbf\xbf\xbe"),
-        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR,
+        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR, $UTF8_GOT_NONCHAR,
         'nonchar', 0x1FFFE, 4,
         qr/Unicode non-character.*is not recommended for open interchange/
     ],
     [ "non-character code point U+1FFFF",
         (isASCII) ? "\xf0\x9f\xbf\xbf" : I8_to_native("\xf3\xbf\xbf\xbf"),
-        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR,
+        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR, $UTF8_GOT_NONCHAR,
         'nonchar', 0x1FFFF, 4,
         qr/Unicode non-character.*is not recommended for open interchange/
     ],
     [ "non-character code point U+2FFFE",
         (isASCII) ? "\xf0\xaf\xbf\xbe" : I8_to_native("\xf5\xbf\xbf\xbe"),
-        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR,
+        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR, $UTF8_GOT_NONCHAR,
         'nonchar', 0x2FFFE, 4,
         qr/Unicode non-character.*is not recommended for open interchange/
     ],
     [ "non-character code point U+2FFFF",
         (isASCII) ? "\xf0\xaf\xbf\xbf" : I8_to_native("\xf5\xbf\xbf\xbf"),
-        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR,
+        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR, $UTF8_GOT_NONCHAR,
         'nonchar', 0x2FFFF, 4,
         qr/Unicode non-character.*is not recommended for open interchange/
     ],
     [ "non-character code point U+3FFFE",
         (isASCII) ? "\xf0\xbf\xbf\xbe" : I8_to_native("\xf7\xbf\xbf\xbe"),
-        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR,
+        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR, $UTF8_GOT_NONCHAR,
         'nonchar', 0x3FFFE, 4,
         qr/Unicode non-character.*is not recommended for open interchange/
     ],
     [ "non-character code point U+3FFFF",
         (isASCII) ? "\xf0\xbf\xbf\xbf" : I8_to_native("\xf7\xbf\xbf\xbf"),
-        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR,
+        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR, $UTF8_GOT_NONCHAR,
         'nonchar', 0x3FFFF, 4,
         qr/Unicode non-character.*is not recommended for open interchange/
     ],
     [ "non-character code point U+4FFFE",
         (isASCII) ? "\xf1\x8f\xbf\xbe" : I8_to_native("\xf8\xa9\xbf\xbf\xbe"),
-        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR,
+        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR, $UTF8_GOT_NONCHAR,
         'nonchar', 0x4FFFE,
         (isASCII) ? 4 : 5,
         qr/Unicode non-character.*is not recommended for open interchange/
     ],
     [ "non-character code point U+4FFFF",
         (isASCII) ? "\xf1\x8f\xbf\xbf" : I8_to_native("\xf8\xa9\xbf\xbf\xbf"),
-        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR,
+        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR, $UTF8_GOT_NONCHAR,
         'nonchar', 0x4FFFF,
         (isASCII) ? 4 : 5,
         qr/Unicode non-character.*is not recommended for open interchange/
     ],
     [ "non-character code point U+5FFFE",
         (isASCII) ? "\xf1\x9f\xbf\xbe" : I8_to_native("\xf8\xab\xbf\xbf\xbe"),
-        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR,
+        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR, $UTF8_GOT_NONCHAR,
         'nonchar', 0x5FFFE,
         (isASCII) ? 4 : 5,
         qr/Unicode non-character.*is not recommended for open interchange/
     ],
     [ "non-character code point U+5FFFF",
         (isASCII) ? "\xf1\x9f\xbf\xbf" : I8_to_native("\xf8\xab\xbf\xbf\xbf"),
-        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR,
+        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR, $UTF8_GOT_NONCHAR,
         'nonchar', 0x5FFFF,
         (isASCII) ? 4 : 5,
         qr/Unicode non-character.*is not recommended for open interchange/
     ],
     [ "non-character code point U+6FFFE",
         (isASCII) ? "\xf1\xaf\xbf\xbe" : I8_to_native("\xf8\xad\xbf\xbf\xbe"),
-        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR,
+        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR, $UTF8_GOT_NONCHAR,
         'nonchar', 0x6FFFE,
         (isASCII) ? 4 : 5,
         qr/Unicode non-character.*is not recommended for open interchange/
     ],
     [ "non-character code point U+6FFFF",
         (isASCII) ? "\xf1\xaf\xbf\xbf" : I8_to_native("\xf8\xad\xbf\xbf\xbf"),
-        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR,
+        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR, $UTF8_GOT_NONCHAR,
         'nonchar', 0x6FFFF,
         (isASCII) ? 4 : 5,
         qr/Unicode non-character.*is not recommended for open interchange/
     ],
     [ "non-character code point U+7FFFE",
         (isASCII) ? "\xf1\xbf\xbf\xbe" : I8_to_native("\xf8\xaf\xbf\xbf\xbe"),
-        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR,
+        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR, $UTF8_GOT_NONCHAR,
         'nonchar', 0x7FFFE,
         (isASCII) ? 4 : 5,
         qr/Unicode non-character.*is not recommended for open interchange/
     ],
     [ "non-character code point U+7FFFF",
         (isASCII) ? "\xf1\xbf\xbf\xbf" : I8_to_native("\xf8\xaf\xbf\xbf\xbf"),
-        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR,
+        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR, $UTF8_GOT_NONCHAR,
         'nonchar', 0x7FFFF,
         (isASCII) ? 4 : 5,
         qr/Unicode non-character.*is not recommended for open interchange/
     ],
     [ "non-character code point U+8FFFE",
         (isASCII) ? "\xf2\x8f\xbf\xbe" : I8_to_native("\xf8\xb1\xbf\xbf\xbe"),
-        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR,
+        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR, $UTF8_GOT_NONCHAR,
         'nonchar', 0x8FFFE,
         (isASCII) ? 4 : 5,
         qr/Unicode non-character.*is not recommended for open interchange/
     ],
     [ "non-character code point U+8FFFF",
         (isASCII) ? "\xf2\x8f\xbf\xbf" : I8_to_native("\xf8\xb1\xbf\xbf\xbf"),
-        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR,
+        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR, $UTF8_GOT_NONCHAR,
         'nonchar', 0x8FFFF,
         (isASCII) ? 4 : 5,
         qr/Unicode non-character.*is not recommended for open interchange/
     ],
     [ "non-character code point U+9FFFE",
         (isASCII) ? "\xf2\x9f\xbf\xbe" : I8_to_native("\xf8\xb3\xbf\xbf\xbe"),
-        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR,
+        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR, $UTF8_GOT_NONCHAR,
         'nonchar', 0x9FFFE,
         (isASCII) ? 4 : 5,
         qr/Unicode non-character.*is not recommended for open interchange/
     ],
     [ "non-character code point U+9FFFF",
         (isASCII) ? "\xf2\x9f\xbf\xbf" : I8_to_native("\xf8\xb3\xbf\xbf\xbf"),
-        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR,
+        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR, $UTF8_GOT_NONCHAR,
         'nonchar', 0x9FFFF,
         (isASCII) ? 4 : 5,
         qr/Unicode non-character.*is not recommended for open interchange/
     ],
     [ "non-character code point U+AFFFE",
         (isASCII) ? "\xf2\xaf\xbf\xbe" : I8_to_native("\xf8\xb5\xbf\xbf\xbe"),
-        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR,
+        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR, $UTF8_GOT_NONCHAR,
         'nonchar', 0xAFFFE,
         (isASCII) ? 4 : 5,
         qr/Unicode non-character.*is not recommended for open interchange/
     ],
     [ "non-character code point U+AFFFF",
         (isASCII) ? "\xf2\xaf\xbf\xbf" : I8_to_native("\xf8\xb5\xbf\xbf\xbf"),
-        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR,
+        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR, $UTF8_GOT_NONCHAR,
         'nonchar', 0xAFFFF,
         (isASCII) ? 4 : 5,
         qr/Unicode non-character.*is not recommended for open interchange/
     ],
     [ "non-character code point U+BFFFE",
         (isASCII) ? "\xf2\xbf\xbf\xbe" : I8_to_native("\xf8\xb7\xbf\xbf\xbe"),
-        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR,
+        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR, $UTF8_GOT_NONCHAR,
         'nonchar', 0xBFFFE,
         (isASCII) ? 4 : 5,
         qr/Unicode non-character.*is not recommended for open interchange/
     ],
     [ "non-character code point U+BFFFF",
         (isASCII) ? "\xf2\xbf\xbf\xbf" : I8_to_native("\xf8\xb7\xbf\xbf\xbf"),
-        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR,
+        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR, $UTF8_GOT_NONCHAR,
         'nonchar', 0xBFFFF,
         (isASCII) ? 4 : 5,
         qr/Unicode non-character.*is not recommended for open interchange/
     ],
     [ "non-character code point U+CFFFE",
         (isASCII) ? "\xf3\x8f\xbf\xbe" : I8_to_native("\xf8\xb9\xbf\xbf\xbe"),
-        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR,
+        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR, $UTF8_GOT_NONCHAR,
         'nonchar', 0xCFFFE,
         (isASCII) ? 4 : 5,
         qr/Unicode non-character.*is not recommended for open interchange/
     ],
     [ "non-character code point U+CFFFF",
         (isASCII) ? "\xf3\x8f\xbf\xbf" : I8_to_native("\xf8\xb9\xbf\xbf\xbf"),
-        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR,
+        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR, $UTF8_GOT_NONCHAR,
         'nonchar', 0xCFFFF,
         (isASCII) ? 4 : 5,
         qr/Unicode non-character.*is not recommended for open interchange/
     ],
     [ "non-character code point U+DFFFE",
         (isASCII) ? "\xf3\x9f\xbf\xbe" : I8_to_native("\xf8\xbb\xbf\xbf\xbe"),
-        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR,
+        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR, $UTF8_GOT_NONCHAR,
         'nonchar', 0xDFFFE,
         (isASCII) ? 4 : 5,
         qr/Unicode non-character.*is not recommended for open interchange/
     ],
     [ "non-character code point U+DFFFF",
         (isASCII) ? "\xf3\x9f\xbf\xbf" : I8_to_native("\xf8\xbb\xbf\xbf\xbf"),
-        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR,
+        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR, $UTF8_GOT_NONCHAR,
         'nonchar', 0xDFFFF,
         (isASCII) ? 4 : 5,
         qr/Unicode non-character.*is not recommended for open interchange/
     ],
     [ "non-character code point U+EFFFE",
         (isASCII) ? "\xf3\xaf\xbf\xbe" : I8_to_native("\xf8\xbd\xbf\xbf\xbe"),
-        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR,
+        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR, $UTF8_GOT_NONCHAR,
         'nonchar', 0xEFFFE,
         (isASCII) ? 4 : 5,
         qr/Unicode non-character.*is not recommended for open interchange/
     ],
     [ "non-character code point U+EFFFF",
         (isASCII) ? "\xf3\xaf\xbf\xbf" : I8_to_native("\xf8\xbd\xbf\xbf\xbf"),
-        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR,
+        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR, $UTF8_GOT_NONCHAR,
         'nonchar', 0xEFFFF,
         (isASCII) ? 4 : 5,
         qr/Unicode non-character.*is not recommended for open interchange/
     ],
     [ "non-character code point U+FFFFE",
         (isASCII) ? "\xf3\xbf\xbf\xbe" : I8_to_native("\xf8\xbf\xbf\xbf\xbe"),
-        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR,
+        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR, $UTF8_GOT_NONCHAR,
         'nonchar', 0xFFFFE,
         (isASCII) ? 4 : 5,
         qr/Unicode non-character.*is not recommended for open interchange/
     ],
     [ "non-character code point U+FFFFF",
         (isASCII) ? "\xf3\xbf\xbf\xbf" : I8_to_native("\xf8\xbf\xbf\xbf\xbf"),
-        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR,
+        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR, $UTF8_GOT_NONCHAR,
         'nonchar', 0xFFFFF,
         (isASCII) ? 4 : 5,
         qr/Unicode non-character.*is not recommended for open interchange/
     ],
     [ "non-character code point U+10FFFE",
         (isASCII) ? "\xf4\x8f\xbf\xbe" : I8_to_native("\xf9\xa1\xbf\xbf\xbe"),
-        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR,
+        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR, $UTF8_GOT_NONCHAR,
         'nonchar', 0x10FFFE,
         (isASCII) ? 4 : 5,
         qr/Unicode non-character.*is not recommended for open interchange/
     ],
     [ "non-character code point U+10FFFF",
         (isASCII) ? "\xf4\x8f\xbf\xbf" : I8_to_native("\xf9\xa1\xbf\xbf\xbf"),
-        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR,
+        $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR, $UTF8_GOT_NONCHAR,
         'nonchar', 0x10FFFF,
         (isASCII) ? 4 : 5,
         qr/Unicode non-character.*is not recommended for open interchange/
@@ -1622,16 +1724,17 @@ my @tests = (
         # This code point is chosen so that it is representable in a UV on
         # 32-bit machines
         $UTF8_WARN_ABOVE_31_BIT, $UTF8_DISALLOW_ABOVE_31_BIT,
+        $UTF8_GOT_ABOVE_31_BIT,
         'utf8', 0x80000000, (isASCII) ? 7 :14,
-        qr/Code point 0x80000000 is not Unicode, and not portable/
+        nonportable_regex(0x80000000)
     ],
     [ "requires at least 32 bits, and use SUPER-type flags, instead of 
ABOVE_31_BIT",
         (isASCII)
          ? "\xfe\x82\x80\x80\x80\x80\x80"
          : 
I8_to_native("\xff\xa0\xa0\xa0\xa0\xa0\xa0\xa2\xa0\xa0\xa0\xa0\xa0\xa0"),
-        $UTF8_WARN_SUPER, $UTF8_DISALLOW_SUPER,
+        $UTF8_WARN_SUPER, $UTF8_DISALLOW_SUPER, $UTF8_GOT_SUPER,
         'utf8', 0x80000000, (isASCII) ? 7 :14,
-        qr/Code point 0x80000000 is not Unicode, and not portable/
+        nonportable_regex(0x80000000)
     ],
     [ "overflow with warnings/disallow for more than 31 bits",
         # This tests the interaction of WARN_ABOVE_31_BIT/DISALLOW_ABOVE_31_BIT
@@ -1649,14 +1752,12 @@ my @tests = (
         : ((isASCII)
            ?              "\xfe\x86\x80\x80\x80\x80\x80"
            : 
I8_to_native("\xff\xa0\xa0\xa0\xa0\xa0\xa0\xa4\xa0\xa0\xa0\xa0\xa0\xa0"))),
-
-        # We include both warning categories to make sure the ABOVE_31_BIT one
-        # has precedence
-        "$UTF8_WARN_ABOVE_31_BIT|$UTF8_WARN_SUPER",
-        "$UTF8_DISALLOW_ABOVE_31_BIT",
+        $UTF8_WARN_ABOVE_31_BIT,
+        $UTF8_DISALLOW_ABOVE_31_BIT,
+        $UTF8_GOT_ABOVE_31_BIT,
         'utf8', 0,
         (! isASCII) ? 14 : ($is64bit) ? 13 : 7,
-        qr/overflow at byte .*, after start byte 0xf/
+        qr/overflows/
     ],
 );
 
@@ -1668,55 +1769,64 @@ if ($is64bit) {
             ?              
"\xff\x80\x80\x80\x80\x80\x81\x80\x80\x80\x80\x80\x80"
             : 
I8_to_native("\xff\xa0\xa0\xa0\xa0\xa0\xa2\xa0\xa0\xa0\xa0\xa0\xa0\xa0"),
             $UTF8_WARN_ABOVE_31_BIT, $UTF8_DISALLOW_ABOVE_31_BIT,
+            $UTF8_GOT_ABOVE_31_BIT,
             'utf8', 0x1000000000, (isASCII) ? 13 : 14,
-            qr/Code point 0x.* is not Unicode, and not portable/
+            qr/and( is)? not portable/
         ];
     if (! isASCII) {
         push @tests,   # These could falsely show wrongly in a naive 
implementation
             [ "requires at least 32 bits",
                 
I8_to_native("\xff\xa0\xa0\xa0\xa0\xa0\xa1\xa0\xa0\xa0\xa0\xa0\xa0\xa0"),
                 $UTF8_WARN_ABOVE_31_BIT,$UTF8_DISALLOW_ABOVE_31_BIT,
+                $UTF8_GOT_ABOVE_31_BIT,
                 'utf8', 0x800000000, 14,
-                qr/Code point 0x800000000 is not Unicode, and not portable/
+                nonportable_regex(0x80000000)
             ],
             [ "requires at least 32 bits",
                 
I8_to_native("\xff\xa0\xa0\xa0\xa0\xa1\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0"),
                 $UTF8_WARN_ABOVE_31_BIT,$UTF8_DISALLOW_ABOVE_31_BIT,
+                $UTF8_GOT_ABOVE_31_BIT,
                 'utf8', 0x10000000000, 14,
-                qr/Code point 0x10000000000 is not Unicode, and not portable/
+                nonportable_regex(0x10000000000)
             ],
             [ "requires at least 32 bits",
                 
I8_to_native("\xff\xa0\xa0\xa0\xa1\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0"),
                 $UTF8_WARN_ABOVE_31_BIT,$UTF8_DISALLOW_ABOVE_31_BIT,
+                $UTF8_GOT_ABOVE_31_BIT,
                 'utf8', 0x200000000000, 14,
-                qr/Code point 0x200000000000 is not Unicode, and not portable/
+                nonportable_regex(0x20000000000)
             ],
             [ "requires at least 32 bits",
                 
I8_to_native("\xff\xa0\xa0\xa1\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0"),
                 $UTF8_WARN_ABOVE_31_BIT,$UTF8_DISALLOW_ABOVE_31_BIT,
+                $UTF8_GOT_ABOVE_31_BIT,
                 'utf8', 0x4000000000000, 14,
-                qr/Code point 0x4000000000000 is not Unicode, and not portable/
+                nonportable_regex(0x4000000000000)
             ],
             [ "requires at least 32 bits",
                 
I8_to_native("\xff\xa0\xa1\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0"),
                 $UTF8_WARN_ABOVE_31_BIT,$UTF8_DISALLOW_ABOVE_31_BIT,
+                $UTF8_GOT_ABOVE_31_BIT,
                 'utf8', 0x80000000000000, 14,
-                qr/Code point 0x80000000000000 is not Unicode, and not 
portable/
+                nonportable_regex(0x80000000000000)
             ],
             [ "requires at least 32 bits",
                 
I8_to_native("\xff\xa1\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0"),
+                   #IBM-1047  
\xFE\x41\x41\x41\x41\x41\x41\x43\x41\x41\x41\x41\x41\x41
                 $UTF8_WARN_ABOVE_31_BIT,$UTF8_DISALLOW_ABOVE_31_BIT,
+                $UTF8_GOT_ABOVE_31_BIT,
                 'utf8', 0x1000000000000000, 14,
-                qr/Code point 0x1000000000000000 is not Unicode, and not 
portable/
+                nonportable_regex(0x1000000000000000)
             ];
     }
 }
 
 foreach my $test (@tests) {
-    my ($testname, $bytes, $warn_flags, $disallow_flags, $category, 
$allowed_uv, $expected_len, $message ) = @$test;
+    my ($testname, $bytes, $warn_flags, $disallow_flags, $expected_error_flags,
+        $category, $allowed_uv, $expected_len, $message ) = @$test;
 
     my $length = length $bytes;
-    my $will_overflow = $testname =~ /overflow/;
+    my $will_overflow = $testname =~ /overflow/ ? 'overflow' : "";
 
     {
         use warnings;
@@ -1736,7 +1846,7 @@ foreach my $test (@tests) {
         unless (is(scalar @warnings, 0,
                 "isUTF8_CHAR() and isUTF8_CHAR()_flags $testname: generated no 
warnings"))
         {
-            diag "The warnings were: " . join(", ", @warnings);
+            output_warnings(@warnings);
         }
 
         undef @warnings;
@@ -1759,7 +1869,7 @@ foreach my $test (@tests) {
         unless (is(scalar @warnings, 0,
                 "isSTRICT_UTF8_CHAR() and isUTF8_CHAR_flags $testname: 
generated no warnings"))
         {
-            diag "The warnings were: " . join(", ", @warnings);
+            output_warnings(@warnings);
         }
 
         undef @warnings;
@@ -1782,7 +1892,7 @@ foreach my $test (@tests) {
         unless (is(scalar @warnings, 0,
                 "isC9_STRICT_UTF8_CHAR() and isUTF8_CHAR_flags $testname: 
generated no warnings"))
         {
-            diag "The warnings were: " . join(", ", @warnings);
+            output_warnings(@warnings);
         }
 
         # Test partial character handling, for each byte not a full character
@@ -1845,7 +1955,7 @@ foreach my $test (@tests) {
                 unless (is(scalar @warnings, 0,
                         "$testname: is_utf8_valid_partial_char_flags() 
generated no warnings"))
                 {
-                    diag "The warnings were: " . join(", ", @warnings);
+                    output_warnings(@warnings);
                 }
             }
         }
@@ -1860,248 +1970,402 @@ foreach my $test (@tests) {
             foreach my $disallow_flag (0, $disallow_flags) {
                 foreach my $do_warning (0, 1) {
 
-                    my $eval_warn = $do_warning
-                                  ? "use warnings '$warning'"
-                                  : $warning eq "utf8"
-                                    ? "no warnings 'utf8'"
-                                    : "use warnings 'utf8'; no warnings 
'$warning'";
-
-                    # is effectively disallowed if will overflow, even if the
-                    # flag indicates it is allowed, fix up test name to
-                    # indicate this as well
-                    my $disallowed = $disallow_flag || $will_overflow;
-
-                    my $this_name = "utf8n_to_uvchr() $testname: " . 
(($disallow_flag)
-                                                    ? 'disallowed'
-                                                    : ($disallowed)
-                                                        ? 'ABOVE_31_BIT 
allowed'
-                                                        : 'allowed');
-                    $this_name .= ", $eval_warn";
-                    $this_name .= ", " . (($warn_flag)
-                                          ? 'with warning flag'
-                                          : 'no warning flag');
-
-                    undef @warnings;
-                    my $ret_ref;
-                    my $display_bytes = display_bytes($bytes);
-                    my $call = "Call was: $eval_warn; \$ret_ref = 
test_utf8n_to_uvchr('$display_bytes', $length, $warn_flag|$disallow_flag)";
-                    my $eval_text =      "$eval_warn; \$ret_ref = 
test_utf8n_to_uvchr('$bytes', $length, $warn_flag|$disallow_flag)";
-                    eval "$eval_text";
-                    if (! ok ("$@ eq ''", "$this_name: eval succeeded")) {
-                        diag "\$!='$!'; eval'd=\"$call\"";
-                        next;
-                    }
-                    if ($disallowed) {
-                        unless (is($ret_ref->[0], 0, "$this_name: Returns 0"))
-                        {
-                            diag $call;
-                        }
-                    }
-                    else {
-                        unless (is($ret_ref->[0], $allowed_uv,
-                                   "$this_name: Returns expected uv: "
-                                 . sprintf("0x%04X", $allowed_uv)))
-                        {
-                            diag $call;
-                        }
-                    }
-                    unless (is($ret_ref->[1], $expected_len,
-                        "$this_name: Returns expected length: $expected_len"))
+                    # We try each of the above with various combinations of
+                    # malformations that can occur on the same input sequence.
+                    foreach my $short ("",
+                                       "short",
+                                       "unexpected non-continuation")
                     {
-                        diag $call;
-                    }
+                        # The non-characters can't be discerned with a short
+                        # malformation
+                        next if $short && $testname =~ /non-character/;
+
+                        foreach my $overlong ("", "overlong") {
+
+                            # Our hard-coded overlong starts with \xFE, so
+                            # can't handle anything larger.
+                            next if $overlong
+                            && ord native_to_I8(substr($bytes, 0, 1)) >= 0xFE;
+
+                            my @malformations;
+                            my @expected_errors;
+                            push @malformations, $short if $short;
+                            push @malformations, $overlong if $overlong;
+
+                            # The overflow malformation test in the input
+                            # array is coerced into being treated like one of
+                            # the others.
+                            if ($will_overflow) {
+                                push @malformations, 'overflow';
+                                push @expected_errors, $UTF8_GOT_OVERFLOW;
+                            }
 
-                    if (! $do_warning
-                        && ($warning eq 'utf8' || $warning eq $category))
-                    {
-                        if (!is(scalar @warnings, 0,
-                                            "$this_name: No warnings 
generated"))
-                        {
-                            diag $call;
-                            diag "The warnings were: " . join(", ", @warnings);
-                        }
-                    }
-                    elsif ($will_overflow
-                           && ! $disallow_flag
-                           && $warning eq 'utf8')
-                    {
+                            my $malformations_name = join "/", @malformations;
+                            $malformations_name .= " malformation"
+                                                        if $malformations_name;
+                            $malformations_name .= "s" if @malformations > 1;
+                            my $this_bytes = $bytes;
+                            my $this_length = $length;
+                            my $expected_uv = $allowed_uv;
+                            my $this_expected_len = $expected_len;
+                            if ($malformations_name) {
+                                $expected_uv = 0;
+
+                                # Coerce the input into the desired
+                                # malformation
+                                if ($malformations_name =~ /overlong/) {
+
+                                    # For an overlong, we convert the original
+                                    # start byte into a continuation byte with
+                                    # the same data bits as originally. ...
+                                    substr($this_bytes, 0, 1)
+                                        = 
start_byte_to_cont(substr($this_bytes,
+                                                                    0, 1));
+
+                                    # ... Then we prepend it with a known
+                                    # overlong sequence.  This should evaluate
+                                    # to the exact same code point as the
+                                    # original.
+                                    $this_bytes = "\xfe"
+                                               . ("\x80"
+                                                   x ( 6 - 
length($this_bytes)))
+                                               . $this_bytes;
+                                    $this_length = length($this_bytes);
+                                    $this_expected_len = 7;
+                                    push @expected_errors, $UTF8_GOT_LONG;
+                                }
+                                if ($malformations_name =~ /short/) {
+
+                                    # Just tell the test to not look far
+                                    # enough into the input.
+                                    $this_length--;
+                                    $this_expected_len--;
+                                    push @expected_errors, $UTF8_GOT_SHORT;
+                                }
+                                elsif ($malformations_name
+                                                        =~ /non-continuation/)
+                                {
+                                    # Change the final continuation byte into
+                                    # a non one.
+                                    substr($this_bytes, -1, 1) = '?';
+                                    $this_expected_len--;
+                                    push @expected_errors,
+                                                    $UTF8_GOT_NON_CONTINUATION;
+                                }
+                            }
 
-                        # Will get the overflow message instead of the expected
-                        # message under these circumstances, as they would
-                        # otherwise accept an overflowed value, which the code
-                        # should not allow, so falls back to overflow.
-                        if (is(scalar @warnings, 1,
-                               "$this_name: Got a single warning "))
-                        {
-                            unless (like($warnings[0], qr/overflow/,
-                                        "$this_name: Got overflow warning"))
+                            my $eval_warn = $do_warning
+                                        ? "use warnings '$warning'"
+                                        : $warning eq "utf8"
+                                            ? "no warnings 'utf8'"
+                                            : ( "use warnings 'utf8';"
+                                              . " no warnings '$warning'");
+
+                            # Is effectively disallowed if we've set up a
+                            # malformation, even if the flag indicates it is
+                            # allowed.  Fix up test name to indicate this as
+                            # well
+                            my $disallowed = $disallow_flag
+                                          || $malformations_name;
+                            my $this_name = "utf8n_to_uvchr_error() $testname: 
"
+                                                        . (($disallow_flag)
+                                                            ? 'disallowed'
+                                                            : $disallowed
+                                                            ? $disallowed
+                                                            : 'allowed');
+                            $this_name .= ", $eval_warn";
+                            $this_name .= ", " . (($warn_flag)
+                                                ? 'with warning flag'
+                                                : 'no warning flag');
+
+                            undef @warnings;
+                            my $ret_ref;
+                            my $display_bytes = display_bytes($this_bytes);
+                            my $call = "Call was: $eval_warn; \$ret_ref"
+                                     . " = test_utf8n_to_uvchr_error("
+                                     . "'$display_bytes', $this_length,"
+                                     . "$warn_flag"
+                                     . "|$disallow_flag)";
+                            my $eval_text =      "$eval_warn; \$ret_ref"
+                                     . " = test_utf8n_to_uvchr_error("
+                                     . "'$this_bytes',"
+                                     . " $this_length, $warn_flag"
+                                     . "|$disallow_flag)";
+                            eval "$eval_text";
+                            if (! ok ("$@ eq ''",
+                                "$this_name: eval succeeded"))
                             {
-                                diag $call;
+                                diag "\$!='$!'; eval'd=\"$call\"";
+                                next;
                             }
-                        }
-                        else {
-                            diag $call;
-                            if (scalar @warnings) {
-                                diag "The warnings were: "
-                                                        . join(", ", 
@warnings);
+                            if ($disallowed) {
+                                unless (is($ret_ref->[0], 0,
+                                           "$this_name: Returns 0"))
+                                {
+                                    diag $call;
+                                }
                             }
-                        }
-                    }
-                    elsif ($warn_flag
-                           && ($warning eq 'utf8' || $warning eq $category))
-                    {
-                        if (is(scalar @warnings, 1,
-                               "$this_name: Got a single warning "))
-                        {
-                            unless (like($warnings[0], $message,
-                                        "$this_name: Got expected warning"))
+                            else {
+                                unless (is($ret_ref->[0], $expected_uv,
+                                        "$this_name: Returns expected uv: "
+                                        . sprintf("0x%04X", $expected_uv)))
+                                {
+                                    diag $call;
+                                }
+                            }
+                            unless (is($ret_ref->[1], $this_expected_len,
+                                "$this_name: Returns expected length:"
+                              . " $this_expected_len"))
                             {
                                 diag $call;
                             }
-                        }
-                        else {
-                            diag $call;
-                            if (scalar @warnings) {
-                                diag "The warnings were: "
-                                                        . join(", ", 
@warnings);
-                            }
-                        }
-                    }
 
-                    # Check CHECK_ONLY results when the input is disallowed.  
Do
-                    # this when actually disallowed, not just when the
-                    # $disallow_flag is set
-                    if ($disallowed) {
-                        undef @warnings;
-                        $ret_ref = test_utf8n_to_uvchr($bytes, $length,
-                                                
$disallow_flag|$UTF8_CHECK_ONLY);
-                        unless (is($ret_ref->[0], 0, "$this_name, CHECK_ONLY: 
Returns 0")) {
-                            diag $call;
-                        }
-                        unless (is($ret_ref->[1], -1,
-                            "$this_name: CHECK_ONLY: returns -1 for length"))
-                        {
-                            diag $call;
-                        }
-                        if (! is(scalar @warnings, 0,
-                            "$this_name, CHECK_ONLY: no warnings generated"))
-                        {
-                            diag $call;
-                            diag "The warnings were: " . join(", ", @warnings);
-                        }
-                    }
+                            my $errors = $ret_ref->[2];
 
-                    # Now repeat some of the above, but for
-                    # uvchr_to_utf8_flags().  Since this comes from an
-                    # existing code point, it hasn't overflowed.
-                    next if $will_overflow;
-
-                    # The warning and disallow flags passed in are for
-                    # utf8n_to_uvchr().  Convert them for
-                    # uvchr_to_utf8_flags().
-                    my $uvchr_warn_flag = 0;
-                    my $uvchr_disallow_flag = 0;
-                    if ($warn_flag) {
-                        if ($warn_flag == $UTF8_WARN_SURROGATE) {
-                            $uvchr_warn_flag = $UNICODE_WARN_SURROGATE
-                        }
-                        elsif ($warn_flag == $UTF8_WARN_NONCHAR) {
-                            $uvchr_warn_flag = $UNICODE_WARN_NONCHAR
-                        }
-                        elsif ($warn_flag == $UTF8_WARN_SUPER) {
-                            $uvchr_warn_flag = $UNICODE_WARN_SUPER
-                        }
-                        elsif ($warn_flag == $UTF8_WARN_ABOVE_31_BIT) {
-                            $uvchr_warn_flag = $UNICODE_WARN_ABOVE_31_BIT;
-                        }
-                        else {
-                            fail(sprintf "Unexpected warn flag: %x",
-                                 $warn_flag);
-                            next;
-                        }
-                    }
-                    if ($disallow_flag) {
-                        if ($disallow_flag == $UTF8_DISALLOW_SURROGATE) {
-                            $uvchr_disallow_flag = $UNICODE_DISALLOW_SURROGATE
-                        }
-                        elsif ($disallow_flag == $UTF8_DISALLOW_NONCHAR) {
-                            $uvchr_disallow_flag = $UNICODE_DISALLOW_NONCHAR
-                        }
-                        elsif ($disallow_flag == $UTF8_DISALLOW_SUPER) {
-                            $uvchr_disallow_flag = $UNICODE_DISALLOW_SUPER
-                        }
-                        elsif ($disallow_flag == $UTF8_DISALLOW_ABOVE_31_BIT) {
-                            $uvchr_disallow_flag =
-                            $UNICODE_DISALLOW_ABOVE_31_BIT;
-                        }
-                        else {
-                            fail(sprintf "Unexpected disallow flag: %x",
-                                 $disallow_flag);
-                            next;
-                        }
-                    }
+                            for (my $i = @expected_errors - 1; $i >= 0; $i--) {
+                                if (ok($expected_errors[$i] & $errors,
+                                       "Expected and got error bit return"
+                                     . " for $malformations[$i] malformation"))
+                                {
+                                    $errors &= ~$expected_errors[$i];
+                                }
+                                splice @expected_errors, $i, 1;
+                            }
+                            unless (is(scalar @expected_errors, 0,
+                                    "Got all the expected malformation 
errors"))
+                            {
+                                diag Dumper \@expected_errors;
+                            }
 
-                    $disallowed = $uvchr_disallow_flag;
+                            if ($warn_flag || $disallow_flag) {
+                                is($errors, $expected_error_flags,
+                                   "Got the correct error flag");
+                            }
+                            else {
+                                is($errors, 0, "Got no other error flag");
+                            }
 
-                    $this_name = "uvchr_to_utf8_flags() $testname: "
-                                                  . (($uvchr_disallow_flag)
-                                                    ? 'disallowed'
-                                                    : ($disallowed)
-                                                      ? 'ABOVE_31_BIT allowed'
-                                                      : 'allowed');
-                    $this_name .= ", $eval_warn";
-                    $this_name .= ", " . (($uvchr_warn_flag)
-                                          ? 'with warning flag'
-                                          : 'no warning flag');
+                            if (@malformations) {
+                                if (! $do_warning && $warning eq 'utf8') {
+                                    goto no_warnings_expected;
+                                }
+
+                                # Check that each malformation generates a
+                                # warning, removing that warning if found
+                              MALFORMATION:
+                                foreach my $malformation (@malformations) {
+                                    foreach (my $i = 0; $i < @warnings; $i++) {
+                                        if ($warnings[$i] =~ /$malformation/) {
+                                            pass("Expected and got"
+                                               . "'$malformation' warning");
+                                            splice @warnings, $i, 1;
+                                            next MALFORMATION;
+                                        }
+                                    }
+                                    fail("Expected '$malformation' warning"
+                                       . "but didn't get it");
+
+                                }
+                            }
 
-                    undef @warnings;
-                    my $ret;
-                    my $warn_flag = sprintf "0x%x", $uvchr_warn_flag;
-                    my $disallow_flag = sprintf "0x%x", $uvchr_disallow_flag;
-                    $call = sprintf "call was: $eval_warn; \$ret = 
test_uvchr_to_utf8_flags(0x%x, $warn_flag|$disallow_flag)", $allowed_uv;
-                    $eval_text = "$eval_warn; \$ret = 
test_uvchr_to_utf8_flags($allowed_uv, $warn_flag|$disallow_flag)";
-                    eval "$eval_text";
-                    if (! ok ("$@ eq ''", "$this_name: eval succeeded")) {
-                        diag "\$!='$!'; eval'd=\"$eval_text\"";
-                        next;
-                    }
-                    if ($disallowed) {
-                        unless (is($ret, undef, "$this_name: Returns undef")) {
-                            diag $call;
-                        }
-                    }
-                    else {
-                        unless (is($ret, $bytes, "$this_name: Returns expected 
string")) {
-                            diag $call;
-                        }
-                    }
-                    if (! $do_warning
-                        && ($warning eq 'utf8' || $warning eq $category))
-                    {
-                        if (!is(scalar @warnings, 0,
-                                            "$this_name: No warnings 
generated"))
-                        {
-                            diag $call;
-                            diag "The warnings were: " . join(", ", @warnings);
-                        }
-                    }
-                    elsif ($uvchr_warn_flag
-                           && ($warning eq 'utf8' || $warning eq $category))
-                    {
-                        if (is(scalar @warnings, 1,
-                               "$this_name: Got a single warning "))
-                        {
-                            unless (like($warnings[0], $message,
+                            # Any overflow will override any super or above-31
+                            # warnings.
+                            goto no_warnings_expected if $will_overflow;
+
+                            if (    ! $do_warning
+                                && (   $warning eq 'utf8'
+                                    || $warning eq $category))
+                            {
+                                goto no_warnings_expected;
+                            }
+                            elsif ($warn_flag) {
+                                if (is(scalar @warnings, 1,
+                                    "$this_name: Got a single warning "))
+                                {
+                                    unless (like($warnings[0], $message,
                                             "$this_name: Got expected 
warning"))
+                                    {
+                                        diag $call;
+                                    }
+                                }
+                                else {
+                                    diag $call;
+                                    if (scalar @warnings) {
+                                        output_warnings(@warnings);
+                                    }
+                                }
+                            }
+                            else {
+                              no_warnings_expected:
+                                unless (is(scalar @warnings, 0,
+                                        "$this_name: Got no warnings"))
+                                {
+                                    diag $call;
+                                    output_warnings(@warnings);
+                                }
+                            }
+
+                            # Check CHECK_ONLY results when the input is
+                            # disallowed.  Do this when actually disallowed,
+                            # not just when the $disallow_flag is set
+                            if ($disallowed) {
+                                undef @warnings;
+                                $ret_ref = test_utf8n_to_uvchr_error(
+                                               $this_bytes, $this_length,
+                                               
$disallow_flag|$UTF8_CHECK_ONLY);
+                                unless (is($ret_ref->[0], 0,
+                                        "$this_name, CHECK_ONLY: Returns 0"))
+                                {
+                                    diag $call;
+                                }
+                                unless (is($ret_ref->[1], -1,
+                                    "$this_name: CHECK_ONLY: returns -1 for"
+                                  . " length"))
+                                {
+                                    diag $call;
+                                }
+                                if (! is(scalar @warnings, 0,
+                                    "$this_name, CHECK_ONLY: no warnings"
+                                  . " generated"))
+                                {
+                                    diag $call;
+                                    output_warnings(@warnings);
+                                }
+                            }
+
+                            # Now repeat some of the above, but for
+                            # uvchr_to_utf8_flags().  Since this comes from an
+                            # existing code point, it hasn't overflowed, and
+                            # isn't malformed.
+                            next if @malformations;
+
+                            # The warning and disallow flags passed in are for
+                            # utf8n_to_uvchr_error().  Convert them for
+                            # uvchr_to_utf8_flags().
+                            my $uvchr_warn_flag = 0;
+                            my $uvchr_disallow_flag = 0;
+                            if ($warn_flag) {
+                                if ($warn_flag == $UTF8_WARN_SURROGATE) {
+                                    $uvchr_warn_flag = $UNICODE_WARN_SURROGATE
+                                }
+                                elsif ($warn_flag == $UTF8_WARN_NONCHAR) {
+                                    $uvchr_warn_flag = $UNICODE_WARN_NONCHAR
+                                }
+                                elsif ($warn_flag == $UTF8_WARN_SUPER) {
+                                    $uvchr_warn_flag = $UNICODE_WARN_SUPER
+                                }
+                                elsif ($warn_flag == $UTF8_WARN_ABOVE_31_BIT) {
+                                    $uvchr_warn_flag
+                                                   = 
$UNICODE_WARN_ABOVE_31_BIT;
+                                }
+                                else {
+                                    fail(sprintf "Unexpected warn flag: %x",
+                                        $warn_flag);
+                                    next;
+                                }
+                            }
+                            if ($disallow_flag) {
+                                if ($disallow_flag == $UTF8_DISALLOW_SURROGATE)
+                                {
+                                    $uvchr_disallow_flag
+                                                = $UNICODE_DISALLOW_SURROGATE;
+                                }
+                                elsif ($disallow_flag == 
$UTF8_DISALLOW_NONCHAR)
+                                {
+                                    $uvchr_disallow_flag
+                                                = $UNICODE_DISALLOW_NONCHAR;
+                                }
+                                elsif ($disallow_flag == $UTF8_DISALLOW_SUPER) 
{
+                                    $uvchr_disallow_flag
+                                                  = $UNICODE_DISALLOW_SUPER;
+                                }
+                                elsif ($disallow_flag
+                                                == $UTF8_DISALLOW_ABOVE_31_BIT)
+                                {
+                                    $uvchr_disallow_flag =
+                                                $UNICODE_DISALLOW_ABOVE_31_BIT;
+                                }
+                                else {
+                                    fail(sprintf "Unexpected disallow flag: 
%x",
+                                        $disallow_flag);
+                                    next;
+                                }
+                            }
+
+                            $disallowed = $uvchr_disallow_flag;
+
+                            $this_name = "uvchr_to_utf8_flags() $testname: "
+                                                    . (($uvchr_disallow_flag)
+                                                        ? 'disallowed'
+                                                        : ($disallowed)
+                                                        ? 'ABOVE_31_BIT 
allowed'
+                                                        : 'allowed');
+                            $this_name .= ", $eval_warn";
+                            $this_name .= ", " . (($uvchr_warn_flag)
+                                                ? 'with warning flag'
+                                                : 'no warning flag');
+
+                            undef @warnings;
+                            my $ret;
+                            my $warn_flag = sprintf "0x%x", $uvchr_warn_flag;
+                            my $disallow_flag = sprintf "0x%x",
+                                                        $uvchr_disallow_flag;
+                            $call = sprintf("call was: $eval_warn; \$ret"
+                                          . " = test_uvchr_to_utf8_flags("
+                                          . " 0x%x, 
$warn_flag|$disallow_flag)",
+                                        $allowed_uv);
+                            $eval_text = "$eval_warn; \$ret ="
+                                       . " test_uvchr_to_utf8_flags("
+                                       . "$allowed_uv, $warn_flag|"
+                                       . "$disallow_flag)";
+                            eval "$eval_text";
+                            if (! ok ("$@ eq ''", "$this_name: eval 
succeeded"))
                             {
-                                diag $call;
+                                diag "\$!='$!'; eval'd=\"$eval_text\"";
+                                next;
                             }
-                        }
-                        else {
-                            diag $call;
-                            if (scalar @warnings) {
-                                diag "The warnings were: "
-                                                        . join(", ", 
@warnings);
+                            if ($disallowed) {
+                                unless (is($ret, undef,
+                                        "$this_name: Returns undef"))
+                                {
+                                    diag $call;
+                                }
+                            }
+                            else {
+                                unless (is($ret, $bytes,
+                                        "$this_name: Returns expected string"))
+                                {
+                                    diag $call;
+                                }
+                            }
+                            if (! $do_warning
+                                && ($warning eq 'utf8' || $warning eq 
$category))
+                            {
+                                if (!is(scalar @warnings, 0,
+                                        "$this_name: No warnings generated"))
+                                {
+                                    diag $call;
+                                    output_warnings(@warnings);
+                                }
+                            }
+                            elsif (       $uvchr_warn_flag
+                                   && (   $warning eq 'utf8'
+                                       || $warning eq $category))
+                            {
+                                if (is(scalar @warnings, 1,
+                                    "$this_name: Got a single warning "))
+                                {
+                                    unless (like($warnings[0], $message,
+                                            "$this_name: Got expected 
warning"))
+                                    {
+                                        diag $call;
+                                    }
+                                }
+                                else {
+                                    diag $call;
+                                    output_warnings(@warnings)
+                                                        if scalar @warnings;
+                                }
                             }
                         }
                     }
diff --git a/lib/utf8.t b/lib/utf8.t
index c9dbb6e..0530faf 100644
--- a/lib/utf8.t
+++ b/lib/utf8.t
@@ -165,7 +165,7 @@ no utf8; # Ironic, no?
     use utf8; %a = ("\xE1\xA0"=>"sterling");
     print 'start'; printf '%x,', ord \$_ foreach keys %a; print "end\n";
 BANG
-             qr/^Malformed UTF-8 character \(\d bytes?, need \d, 
.+\).*start\d+,end$/sm
+             qr/^Malformed UTF-8 character: .*? \(too short; got \d bytes?, 
need \d\).*start\d+,end$/sm
             ],
             );
     foreach (@tests) {
diff --git a/pod/perldelta.pod b/pod/perldelta.pod
index 2c3986b..7479528 100644
--- a/pod/perldelta.pod
+++ b/pod/perldelta.pod
@@ -204,7 +204,9 @@ XXX Changes (i.e. rewording) of diagnostic messages go here
 
 =item *
 
-XXX Describe change here
+Details as to the exact problem have been added to the diagnostics that
+occur when malformed UTF-8 is encountered when trying to convert to a
+code point.
 
 =back
 
@@ -326,6 +328,21 @@ The C<PADOFFSET> type has changed from being unsigned to 
signed, and
 several pad-related variables such as C<PL_padix> have changed from being
 of type C<I32> to type C<PADOFFSET>.
 
+=item *
+
+The function C<L<perlapi/utf8n_to_uvchr>> has been changed to not
+abandon searching for other malformations when the first one is
+encountered.  A call to it thus can generate multiple diagnostics,
+instead of just one.
+
+=item *
+
+A new function, C<L<perlapi/utf8n_to_uvchr_error>>, has been added for
+use by modules that need to know the details of UTF-8 malformations
+beyond pass/fail.  Previously, the only ways to know why a sequence was
+ill-formed was to capture and parse the generated diagnostics, or to do
+your own analysis.
+
 =back
 
 =head1 Selected Bug Fixes
diff --git a/pod/perldiag.pod b/pod/perldiag.pod
index 8348663..6b42a00 100644
--- a/pod/perldiag.pod
+++ b/pod/perldiag.pod
@@ -3344,10 +3344,13 @@ Perhaps the function's author was trying to write a 
subroutine signature
 but didn't enable that feature first (C<use feature 'signatures'>),
 so the signature was instead interpreted as a bad prototype.
 
-=item Malformed UTF-8 character (%s)
+=item Malformed UTF-8 character%s
 
-(S utf8)(F) Perl detected a string that didn't comply with UTF-8
-encoding rules, even though it had the UTF8 flag on.
+(S utf8)(F) Perl detected a string that should be UTF-8, but didn't
+comply with UTF-8 encoding rules, or represents a code point whose
+ordinal integer value doesn't fit into the word size of the current
+platform (overflows).  Details as to the exact malformation are given in
+the variable, C<%s>, part of the message.
 
 One possible cause is that you set the UTF8 flag yourself for data that
 you thought to be in UTF-8 but it wasn't (it was for example legacy
diff --git a/proto.h b/proto.h
index 9f504ea..701dc9e 100644
--- a/proto.h
+++ b/proto.h
@@ -3534,9 +3534,12 @@ PERL_CALLCONV UV Perl_utf8_to_uvuni_buf(pTHX_ const U8 
*s, const U8 *send, STRLE
 #define PERL_ARGS_ASSERT_UTF8_TO_UVUNI_BUF     \
        assert(s); assert(send)
 
-PERL_CALLCONV UV       Perl_utf8n_to_uvchr(pTHX_ const U8 *s, STRLEN curlen, 
STRLEN *retlen, U32 flags);
+PERL_CALLCONV UV       Perl_utf8n_to_uvchr(pTHX_ const U8 *s, STRLEN curlen, 
STRLEN *retlen, const U32 flags);
 #define PERL_ARGS_ASSERT_UTF8N_TO_UVCHR        \
        assert(s)
+PERL_CALLCONV UV       Perl_utf8n_to_uvchr_error(pTHX_ const U8 *s, STRLEN 
curlen, STRLEN *retlen, const U32 flags, U32 * errors);
+#define PERL_ARGS_ASSERT_UTF8N_TO_UVCHR_ERROR  \
+       assert(s)
 PERL_CALLCONV UV       Perl_utf8n_to_uvuni(pTHX_ const U8 *s, STRLEN curlen, 
STRLEN *retlen, U32 flags);
 #define PERL_ARGS_ASSERT_UTF8N_TO_UVUNI        \
        assert(s)
@@ -5572,6 +5575,9 @@ STATIC bool       S_isa_lookup(pTHX_ HV *stash, const 
char * const name, STRLEN len, U
        assert(stash); assert(name)
 #endif
 #if defined(PERL_IN_UTF8_C)
+STATIC char *  S__byte_dump_string(pTHX_ const U8 * s, const STRLEN len);
**** PATCH TRUNCATED AT 2000 LINES -- 1642 NOT SHOWN ****

--
Perl5 Master Repository

Reply via email to