[perl.git] branch blead, updated. v5.27.1-221-gb7d798d66e

Karl Williamson Wed, 12 Jul 2017 20:24:55 -0700

In perl.git, the branch blead has been updated

<http://perl5.git.perl.org/perl.git/commitdiff/b7d798d66e9bbbf5a60d21864d3d5b10cb4adb29?hp=1e629c2290cf7b0cfc69d56f21dbb4c53f60127d>


- Log -----------------------------------------------------------------
commit b7d798d66e9bbbf5a60d21864d3d5b10cb4adb29
Merge: 1e629c2290 d22ec71778
Author: Karl Williamson <[email protected]>
Date:   Wed Jul 12 21:15:09 2017 -0600

    Merge branch 'utf8 fixes' into blead
    
    This branch reimplements the forbidding of code points above IV_MAX in
    such a way that encountering UTF-8 evaluating to such doesn't kill the
    receiving process, but is treated as an ordinary overflow.  To do
    otherwise can lead to Denial of Service attacks.
    
    It fixes several bugs that occur only on UTF-8 that is malformed or for
    very large code points.
    
    And it cleans up and revamps the testing of the XS API for UTF-8 so that
    more coverage is done, but in a fraction of the previous time needed.

commit d22ec71778db5c28dfb4e339337f90dad5d214f9
Author: Karl Williamson <[email protected]>
Date:   Sat Jul 1 11:58:00 2017 -0600

    Forbid above IV_MAX code points
    
    This implements the restriction of code points to 0..IV_MAX in such a
    way that the process doesn't die when presented with input UTF-8 that
    evaluates to a larger one.  Instead, it is treated as overflow.
    
    The commit reinstates causing the offending process to die if trying to
    create a character somehow that is above IV_MAX (like
    chr(0xFFFFFFFFFFFFF) or trying to do certain operations on one if
    somehow one did get created.
    
    The long term goal is to use code points above IV_MAX internally, as
    Perl6 does.  So code and tests are not removed, just commented out

M       ext/XS-APItest/t/utf8_warn_base.pl
M       t/lib/warnings/utf8
M       t/op/index.t
M       t/op/utf8decode.t
M       t/re/pat_advanced.t
M       utf8.c

commit e050c0076b9d0972c025d71afe0180d9dfbc6b15
Author: Karl Williamson <[email protected]>
Date:   Wed Jul 12 20:28:45 2017 -0600

    utf8.c: Change 2 static fcns to handle overlongs
    
    This will be used in the following commit.
    
    One function is made more complicated, so we stop asking it to be
    inlined.

M       embed.fnc
M       proto.h
M       utf8.c

commit 5f995336c78d31708a69477c3351b87e285d64b8
Author: Karl Williamson <[email protected]>
Date:   Wed Jul 12 20:26:18 2017 -0600

    utf8.c: Move and slightly change comment block
    
    This is so there are fewer real differences shown in the next commit

M       utf8.c

commit c285bbc4a6321e4e787d0fac9f34c354c7647256
Author: Karl Williamson <[email protected]>
Date:   Sat Jul 1 07:21:09 2017 -0600

    utf8.c: Generalize static fcn return for indeterminate result
    
    This makes it harder to think that 0 means a definite FALSE.

M       embed.fnc
M       proto.h
M       utf8.c

commit a77c906e26e63e32dbf58d6de81399b8e3534fd1
Author: Karl Williamson <[email protected]>
Date:   Sat Jul 1 06:32:28 2017 -0600

    utf8.c: Move a fcn within the file
    
    This simply moves a function to later in the file.  The next commIt will
    change it to needing a definition which, until this commit, came after it
    in the file, and so was not available to it.

M       utf8.c

commit d6be65aef0919d5dceda4442de95f5de90b57e41
Author: Karl Williamson <[email protected]>
Date:   Sat Jul 1 06:43:34 2017 -0600

    utf8.c: Generalize static fcn return for indeterminate result
    
    This makes it harder to think that 0 means a definite FALSE.

M       embed.fnc
M       proto.h
M       utf8.c

commit 8d6204cca6dfeab549e87d3b40fa1900a8ca8a0e
Author: Karl Williamson <[email protected]>
Date:   Sat Jul 1 06:18:01 2017 -0600

    utf8.c: Generalize static fcn return for indeterminate result
    
    Prior to this commit, isFF_OVERLONG() returned a boolean, with 0 also
    indicating that there wasn't enough information to make a determination.
    I realized that I was forgetting that 0 wasn't necessarily definitive
    while coding.  By changing the API to return 3 values, forgetting that
    won't likely happen.
    
    This and the next several commits change several other functions that
    have the same predicament.

M       embed.fnc
M       proto.h
M       utf8.c

commit c4e96019708f80aedf076564f0d2994581c027b9
Author: Karl Williamson <[email protected]>
Date:   Fri Jun 30 13:21:58 2017 -0600

    utf8.h: Comments only
    
    An earlier commit had split some comments up.  And this adds clarifying
    details.

M       utf8.h

commit 0a8a1a5b0c576b95f3c4a48a6912f86bcf34e281
Author: Karl Williamson <[email protected]>
Date:   Fri Jun 30 13:19:10 2017 -0600

    utf8.c: Reorder two 'if' clauses
    
    This is purely to get vertical line up that easier to see of slightly
    differently spelled tests

M       utf8.c

commit 10dabba92c9a6ec2175b19c741c75d807eb02601
Author: Karl Williamson <[email protected]>
Date:   Fri Jun 30 11:19:59 2017 -0600

    utf8.c: Slightly simplify some code
    
    This just does a small refactor, which I think makes things easier to
    understand.

M       utf8.c

commit 08e73697e67f540b8a9dfcb30133d7176534e3c5
Author: Karl Williamson <[email protected]>
Date:   Sat Jul 8 14:54:28 2017 -0600

    utf8n_to_uvchr(): Properly handle extremely high code points
    
    It turns out that it could incorrectly deem something to be overflowing
    or overlong.  This fixes that and changes the test to catch this
    possibility.  This fixes a bug, so now on 32-bit systems, it detects
    that if you have a start byte of FE, you need  a continuation byte to
    determine if the result overflows.

M       ext/XS-APItest/t/utf8_warn_base.pl
M       t/op/utf8decode.t
M       utf8.c

commit 6af459e6161c58fe108a33cf114431c9f521758a
Author: Karl Williamson <[email protected]>
Date:   Fri Jul 7 12:39:33 2017 -0600

    rm APItest/t/utf8_malformed.t
    
    This file no longer contains any tests.  All were either made redundant
    with utf8_warn_base.pl or have been moved to it.

M       MANIFEST
D       ext/XS-APItest/t/utf8_malformed.t

commit 1d21b5e70d6f1ea213b7e73d35b3143fee2f05e7
Author: Karl Williamson <[email protected]>
Date:   Fri Jul 7 12:37:39 2017 -0600

    Move test to utf8_warn_base.pl
    
    This is the final test that was in utf8_malformed.t.  The next commit
    will remove the file.

M       ext/XS-APItest/t/utf8_malformed.t
M       ext/XS-APItest/t/utf8_warn_base.pl

commit a8cda038929382795c1d3d13a8e65bb559baf87a
Author: Karl Williamson <[email protected]>
Date:   Wed Jul 5 10:27:25 2017 -0600

    APItest/t/utf8_malformed.t: Remove 2 redundant tests
    
    These tests for the malformation where a UTF-8 sequence is interrupted
    by the beginning of another character, already get tested int
    utf8_warn_base.pl

M       ext/XS-APItest/t/utf8_malformed.t

commit 2282dfc402901bcee8aec629d87424ca54614bee
Author: Karl Williamson <[email protected]>
Date:   Fri Jul 7 15:20:44 2017 -0600

    APItest/t/utf8_warn_base.pl: White-space only
    
    This indents properly after the previous commit created a block around
    this code, and reflows to fit in 79 columns.

M       ext/XS-APItest/t/utf8_warn_base.pl

commit cf8a8202801e2546c26b9aedd1337bcad8b2decb
Author: Karl Williamson <[email protected]>
Date:   Tue Jul 4 12:57:40 2017 -0600

    APItest/t/utf8_warn_base.pl: Add a test
    
    This verifies that we don't mistake an overlong for overflow

M       ext/XS-APItest/t/utf8_warn_base.pl

commit 3f0559171310ec103fb86919d945272c1d49b308
Author: Karl Williamson <[email protected]>
Date:   Tue Jul 4 16:04:26 2017 -0600

    APItest/t/utf8_malformed.t: move tests to utf8_warn_base.pl
    
    This adds infrastructure to utf8_warn_base.pl to handle the overlong
    tests that are now moved to it from utf8_malformed.t

M       ext/XS-APItest/t/utf8_malformed.t
M       ext/XS-APItest/t/utf8_warn_base.pl

commit 45d8ef8c7a1678ef29d3af5c294252de6f976fd3
Author: Karl Williamson <[email protected]>
Date:   Tue Jul 4 12:22:29 2017 -0600

    APItest/t/utf8_malformed.t: move test to utf8_warn_base.pl
    
    Actually, this test was already in utf8_warn_base, but was executed only
    on 64 bit platforms.  It is reasonable to make sure it works on 32 bit
    ones, as it is an edge case there as well, in the sense that it is the
    first 13 byte code point.
    
    This is the first of a series of commits to remove all the tests in
    utf8_malformed, so the entire file can be removed.
    
    utf8_warn_base has been heavily cleaned up, and now has better
    infrastructure for more completely testing than utf8_malformed.  The
    two files have much the same logic, and rather than trying to maintain
    two versions, it's better to combine them.

M       ext/XS-APItest/t/utf8_malformed.t
M       ext/XS-APItest/t/utf8_warn_base.pl

commit 969872955a4494d1f8c5bdb29a0e2b7e3f9633ec
Author: Karl Williamson <[email protected]>
Date:   Tue Jul 4 13:23:18 2017 -0600

    APItest/t/utf8_malformed.t: Remove redundant test
    
    This tests the too short malformation, which is already adequately
    tested in utf8_warn_base.pl

M       ext/XS-APItest/t/utf8_malformed.t

commit f636aa19d5be9f55aee67cb862e158e9d515278e
Author: Karl Williamson <[email protected]>
Date:   Tue Jul 4 13:19:33 2017 -0600

    APItest/t/utf8_malformed.t: Remove 2 redundant tests
    
    These test overflowing, which is already adequately tested in
    utf8_warn_base.pl

M       ext/XS-APItest/t/utf8_malformed.t

commit 6b328a62987856665a850935ca56a35bb72bd001
Author: Karl Williamson <[email protected]>
Date:   Tue Jul 4 10:06:37 2017 -0600

    APItest/t/utf8_malformed.t: Remove redundant test
    
    This test already is covered in utf8_warn_base.pl.  It tests an overlong
    for 2**32.

M       ext/XS-APItest/t/utf8_malformed.t

commit 8e0327af1de3a51981f817d106c92114bccab5e3
Author: Karl Williamson <[email protected]>
Date:   Fri Jul 7 10:56:23 2017 -0600

    APItest/t/utf8_warn_base.pl: Add tests
    
    This test takes its various base tests, and intentionally perturbs them to
    create malformations to additionally test.  Prior to this commit, only
    the function utf8n_to_uvchr_error() was being tested with these
    perturbations.  Now, the functions whoe names start with 'is' also get
    tested.

M       ext/XS-APItest/t/utf8_warn_base.pl

commit 8e0e76af48558f95aad7b438f56e9934307615ba
Author: Karl Williamson <[email protected]>
Date:   Wed Jul 5 14:58:43 2017 -0600

    APItest/t/utf8_warn_base.pl: Move some tests
    
    This just moves a block and indents and reflows it.  It is moved to
    within the loops that set up various malformations in the input.  The
    next commit will change these tests to actually use the perturbed
    inputs.

M       ext/XS-APItest/t/utf8_warn_base.pl

commit 153bcbd69bf72d0fe3f203f3dd6c51551fdc20f1
Author: Karl Williamson <[email protected]>
Date:   Wed Jul 5 13:09:27 2017 -0600

    APItest/t/utf8_warn_base.pl: Move some setup code
    
    We don't need this code until we've determined we're actually going to
    go through with a test.

M       ext/XS-APItest/t/utf8_warn_base.pl

commit a93dd121cb713b954d2e7ba422a8073211d4230a
Author: Karl Williamson <[email protected]>
Date:   Fri Jul 7 10:34:01 2017 -0600

    APItest/t/utf8_warn_base.pl: Clean up test name
    
    This name was confusing, as there are two types of things that can be
    (dis)allowed, and in the case of an overflow, the first type is not
    being tested but has the adjective (dis)allowed present.  Add the term
    only when appropriate.

M       ext/XS-APItest/t/utf8_warn_base.pl

commit b39a36941224113cceab40de28a787d572c9d707
Author: Karl Williamson <[email protected]>
Date:   Wed Jul 5 13:00:03 2017 -0600

    APItest/t/utf8_warn_base.pl: Skip inappropriate tests
    
    If we don't have enough information for the test to be meaningful, don't
    bother doing it.

M       ext/XS-APItest/t/utf8_warn_base.pl

commit 404a1403f230a499ebbd08ad7ed02da2e30951d9
Author: Karl Williamson <[email protected]>
Date:   Fri Jun 30 22:29:36 2017 -0600

    APItest/t/utf8_warn_base.pl: Use a default value
    
    This adds a default number of bytes needed to detect overflows, like
    previous commits have added defaults for other categories.

M       ext/XS-APItest/t/utf8_warn_base.pl

commit 57ff5f598ddf7ce8834832a15ba1a4628b5932c4
Author: Karl Williamson <[email protected]>
Date:   Tue Jun 27 14:46:26 2017 -0600

    utf8n_to_uvchr() Properly test for extended UTF-8
    
    It somehow dawned on me that the code is incorrect for
    warning/disallowing very high code points.  What is really wanted in the
    API is to catch UTF-8 that is not necessarily portable.  There are
    several classes of this, but I'm referring here to just the code points
    that are above the Unicode-defined maximum of 0x10FFFF.  These can be
    considered non-portable, and there is a mechanism in the API to
    warn/disallow these.
    
    However an earlier standard defined UTF-8 to handle code points up to
    2**31-1.  Anything above that is using an extension to UTF-8 that has
    never been officially recognized.  Perl does use such an extension, and
    the API is supposed to have a different mechanism to warn/disallow on
    this.
    
    Thus there are two classes of warning/disallowing for above-Unicode code
    points.  One for things that have some non-Unicode official recognition,
    and the other for things that have never had official recognition.
    
    UTF-EBCDIC differs somewhat in this, and since Perl 5.24, we have had a
    Perl extension that allows it to handle any code point that fits in a
    64-bit word.  This kicks in at code points above 2**30-1, a number
    different than UTF-8 extended kicks in on ASCII platforms.
    
    Things are also complicated by the fact that the API has provisions for
    accepting the overlong UTF-8 malformation.  It is possible to use
    extended UTF-8 to represent code points smaller than 31-bit ones.
    
    Until this commit, the extended warning/disallowing was based on the
    resultant code point, and only when that code point did not fit into 31
    bits.
    
    But what is really wanted is if extended UTF-8 was used to represent a
    code point, no matter how large the resultant code point is.  This
    differs from the previous definition, but only for EBCDIC platforms, or
    when the overlong malformation was also present.  So it does not affect
    very many real-world cases.
    
    This commit fixes that.  It turns out that it is easier to tell if
    something is using extended-UTF8.  One just looks at the first byte of a
    sequence.
    
    The trailing part of the warning message that gets raised is slightly
    changed to be clearer.  It's not significant enough to affect perldiag.

M       embed.fnc
M       embed.h
M       ext/XS-APItest/t/utf8_warn_base.pl
M       proto.h
M       utf8.c
M       utf8.h
M       utfebcdic.h

commit d044b7a780a1f1916e96ed7d255bb0b7dad54713
Author: Karl Williamson <[email protected]>
Date:   Mon Jun 26 11:43:21 2017 -0600

    utf8.h: Add synonyms for flag names
    
    The next commit will fix the detection of using Perl's extended UTF-8 to
    be more accurate.  The current name for various flags in the API is
    somewhat misleading.  What is really wanted to know is if extended UTF-8
    was used, not the value of the resultant code point.
    
    This commit basically does
    
        s/ABOVE_31_BIT/PERL_EXTENDED/g
    
    It also similarly changes the name of a hash key in APItest/t/utf8.t.
    
    This intermediary step makes the next commit easier to read.

M       ext/XS-APItest/t/utf8.t
M       ext/XS-APItest/t/utf8_setup.pl
M       ext/XS-APItest/t/utf8_warn_base.pl
M       inline.h
M       utf8.c
M       utf8.h

commit 9e35eec98dba17eb946a8462b68223b962fdcf46
Author: Karl Williamson <[email protected]>
Date:   Mon Jun 26 22:22:32 2017 -0600

    APItest/t/utf8_warn_base.pl: Generate smaller overlongs
    
    This file generates overlongs for testing that that malformation is
    handled properly.  This commit changes it to avoid generating an
    overlong that uses Perl's extended UTF-8.  This will come in handy a
    couple of commits from now, when a bug dealing with that gets fixed.
    
    It also moves setting a variable to outside the loop

M       ext/XS-APItest/t/utf8_warn_base.pl

commit 54c8f8c52d2e0b0b9a87e5f85907fba24c2481f9
Author: Karl Williamson <[email protected]>
Date:   Fri Jun 30 12:57:49 2017 -0600

    APItest/t/utf8_warn_base.pl: Data::Dumper isn't needed

M       ext/XS-APItest/t/utf8_warn_base.pl

commit 4c210ae9fdfacb3cd9e99e11f0a1b017805effa6
Author: Karl Williamson <[email protected]>
Date:   Fri Jun 30 13:14:57 2017 -0600

    APItest/t/utf8_warn_base.pl: Move some tests from loop
    
    These test if any warnings are generated.  None are ever likely to be
    given the way things work.  We can test after the loop that none of the
    iterations generated warnings, as any would accumulate.

M       ext/XS-APItest/t/utf8_warn_base.pl

commit 6d736463b2cba639a7af7b09134aa79854c4dcd8
Author: Karl Williamson <[email protected]>
Date:   Sun Jun 25 21:35:05 2017 -0600

    APItest/t/utf8_warn_base.pl: Extract code into a fcn
    
    This uses a function to test for a common paradigm.  The next couple of
    commits will change that paradigm, and now the code will only have to
    change in one place.

M       ext/XS-APItest/t/utf8_warn_base.pl

commit 6c64cd9dad2f8d5da8eb9122ab251b85430718e2
Author: Karl Williamson <[email protected]>
Date:   Mon Jun 19 12:58:19 2017 -0600

    utf8.c: Fix bugs with overlongs combined with other malformations.
    
    The code handling the UTF-8 overlong malformation must come after
    handling all the other malformations.  This is because it may change the
    code point represented to the REPLACEMENT CHARACTER.  The other
    malformation code is expecting the code point to be the original one.
    This may cause failure to catch and report other malformations, or
    report the wrong value of the erroneous code point.
    
    What was needed was simply to move the 'if else' branch for overlongs to
    after the branches for the other formations.

M       ext/XS-APItest/t/utf8_warn_base.pl
M       utf8.c

commit 717dd9f90cd9f85157fcc3cb27284614dc319825
Author: Karl Williamson <[email protected]>
Date:   Sat Jun 24 22:55:10 2017 -0600

    APItest/t/utf8_warn_base.pl: Add some tests
    
    This adds testing for having some malformations allowed.  These had not
    been checked for, and there were some bugs.  It's easiest to TODO all
    ones that might fail, creating many passing TODOs.  The TODO will be
    removed in the next commit.

M       ext/XS-APItest/t/utf8_warn_base.pl

commit 4816e15f734af639c297c5a7aa5b46861bb59871
Author: Karl Williamson <[email protected]>
Date:   Sat Jun 24 22:42:25 2017 -0600

    APItest/t/utf8_warn_base.pl: Move things out of inner loop
    
    The most expensive stuff in this set of nested loops can actually be
    done several nests up (even higher for some things, but it's not worth
    the trouble).  Given that this test file has been too-long runnning, I
    moved things to an outer loop context.

M       ext/XS-APItest/t/utf8_warn_base.pl

commit 8139d29189cdd6a84a94add791bc278835a50dbc
Author: Karl Williamson <[email protected]>
Date:   Sat Jun 24 21:32:41 2017 -0600

    APItest/t/utf8_warn_base.pl: Reorder loop nesting
    
    This is in preparation for the next commit.  It also changes some of the
    loop variables to 1 to indicate truth, rather than a string.  This will
    make some things easier later.

M       ext/XS-APItest/t/utf8_warn_base.pl

commit c0e5ee4b2e22028e813932f7003d80e2e14c3c91
Author: Karl Williamson <[email protected]>
Date:   Wed Jun 21 13:38:55 2017 -0600

    APItest/t/utf8_warn_base.pl: Revamp testing isFOO
    
    Several commits ago, the loop that handles testing the functions that
    convert from/to UTF-8 was revampled.  This commit does a similar thing
    for the portion of the code that handles the isFOO functions, and
    partial character recognition.
    
    It reorders the nesting of loops so that more tests can be done than
    previously in the outer loop.  Among these, it now doesn't skip overflow
    and deals with using Perl's extended UTF-8 better.

M       ext/XS-APItest/t/utf8_warn_base.pl

commit 56d0c37799bcd0222199720c9af1333588aefa65
Author: Karl Williamson <[email protected]>
Date:   Mon Jun 19 12:56:38 2017 -0600

    utf8n_to_uvchr: U+ should be for only Unicode code points
    
    For above-Unicode, we should use 0xDEADBEEF instead of U+DEADBEEF.
                                     ^^                    ^^
    This is because U+ only applies to Unicode.  This only affects a warning
    message for overlongs.

M       ext/XS-APItest/t/utf8_warn_base.pl
M       utf8.c

commit ba627a0b7288a0ee70996a0740fcbee1c34ef451
Author: Karl Williamson <[email protected]>
Date:   Mon Jun 19 11:52:34 2017 -0600

    APItest/t/utf8_warn_base.pl: Add some tests
    
    This adds the edges between overflowing and not on 64-bit platforms

M       ext/XS-APItest/t/utf8_warn_base.pl

commit f4da64d47abe4315017f8bb4c0565ea96363041f
Author: Karl Williamson <[email protected]>
Date:   Mon Jun 19 11:47:54 2017 -0600

    APItest/t/utf8_warn_base.pl: Do test on all platforms
    
    This modifies and moves a test so it gets done on all platforms, not
    just 32-bit ASCII.  It is an edge case on all platforms, but gives
    differing results, overflowing on 32-bit ones.

M       ext/XS-APItest/t/utf8_warn_base.pl

commit f6104ed73f9c2a5567b0b6e401f67c1b489ab0c3
Author: Karl Williamson <[email protected]>
Date:   Mon Jun 19 11:01:54 2017 -0600

    APItest/t/utf8_warn_base.pl: Rename and modify test
    
    This test is testing the first code point that requires 13 UTF-8 bytes
    to represent on ASCII platforms.  Change the name from its previous
    vague one to one that indicates this.  And don't test for it on EBCDIC
    platforms, as it isn't an edge case there.

M       ext/XS-APItest/t/utf8_warn_base.pl

commit 15dea659ecefb1bd46b95324cbccfe4d65887a53
Author: Karl Williamson <[email protected]>
Date:   Sun Jun 18 22:55:38 2017 -0600

    APItest/t/utf8_warn_base.pl: Remove obsolete test
    
    This was an attempt to test the fact that very high code points are
    controlled both by regular above-Unicode warnings, and special,
    non-portable warnings.  This test is now done better in the loop in the
    file.

M       ext/XS-APItest/t/utf8_warn_base.pl

commit 48ceb60a20ac85a763f25b1a6d5bb816d86ef953
Author: Karl Williamson <[email protected]>
Date:   Sun Jun 18 22:52:06 2017 -0600

    APItest/t/utf8_warn_base.pl: Rename a test
    
    The names are now more uniform.

M       ext/XS-APItest/t/utf8_warn_base.pl

commit 7c9f4ec632af9febc957e5c2fc4170b4df830607
Author: Karl Williamson <[email protected]>
Date:   Sun Jun 18 22:50:12 2017 -0600

    APItest/t/utf8_warn_base.pl: Move some tests in the file
    
    The order had been to mostly test in increasing code point order.  This
    sorts the two exceptions to comply.

M       ext/XS-APItest/t/utf8_warn_base.pl

commit c81d4d83678f0325c3e898dc4fd8e5a10a28da62
Author: Karl Williamson <[email protected]>
Date:   Sun Jun 18 22:36:21 2017 -0600

    APItest/t/utf8_warn_base.pl: Split test into 64 vs 32 bit versions
    
    It's cleaner to have this test which differs on 32 vs 64 bit platforms
    in the appropriate sections that have other tests specific to their
    platforms.
    
    The tests for EBCDIC were arbitrary, just placeholders really, since
    these particular tests were added for situations found only on ASCII
    platforms.  Therefore, the EBCDIC tests were removed.

M       ext/XS-APItest/t/utf8_warn_base.pl

commit c7f9e4bbbd50ef38446171dfb3a7f236a4e7380a
Author: Karl Williamson <[email protected]>
Date:   Sun Jun 18 22:25:39 2017 -0600

    APItest/t/utf8_warn_base.pl: Create block for warnings control
    
    This adds a block that turns off warnings in the whole thing, so that
    tests can be more easily be modified in future commits, and the interior
    warnings control statments can be removed.

M       ext/XS-APItest/t/utf8_warn_base.pl

commit 8f79178bd42d2f82f7b0a3f1a58dcb81f3316e97
Author: Karl Williamson <[email protected]>
Date:   Sat Jun 17 22:31:58 2017 -0600

    APItest/t/utf8_warn_base.pl: White-space, comments only
    
    This reflows things after the changes in the previous commits

M       ext/XS-APItest/t/utf8_warn_base.pl

commit d42d7565c97e85ea4b3d3478035263f6ecb73ba4
Author: Karl Williamson <[email protected]>
Date:   Sat Jun 17 18:58:54 2017 -0600

    APItest/t/utf8_warn_base.pl: Remove hash element
    
    The previous commit has enabled this one to remove another of the hash
    elements from the tests data structure.  The value can now be calculated
    from the code point.  The element gave the warnings category to used.
    But now we use the category based on the code point, with special
    handling of the ones that can be true for regular above-Unicode, and
    those that are so far above Unicode that they must use Perl's extended
    UTF-8.

M       ext/XS-APItest/t/utf8_warn_base.pl

commit e4e140b4f9385b86b0945068ab7879b0d5d44672
Author: Karl Williamson <[email protected]>
Date:   Sat Jun 17 06:43:03 2017 -0600

    APItest/t/utf8_warn_base.pl: Remove most tests
    
    In order to test that the various flags passed to utf8n_to_uvchr()
    work independently of each other, previously this file tried all
    possible combinations.  But, as explained in the comments added in this
    commit, by appropriate use of all the flags that don't apply to
    something being tested, we can verify that those flags are independent
    of that thing, and cut down the combinatorial complexity significantly.

M       ext/XS-APItest/t/utf8_warn_base.pl

commit db0f09e6d289b9eb97f8d5d0ac4172233a6a2829
Author: Karl Williamson <[email protected]>
Date:   Thu Jun 15 12:06:57 2017 -0600

    utf8n_to_uvchr() Use correct warnings category
    
    The warning about too large a code point should be under the
    'non_unicode' warnings category.

M       ext/XS-APItest/t/utf8_warn_base.pl
M       t/lib/warnings/utf8
M       utf8.c

commit 69485e19957cdc18073e734041cf0c35515c0995
Author: Karl Williamson <[email protected]>
Date:   Sun Jul 2 09:11:17 2017 -0600

    APItest/t/utf8_warn_base.pl: Revamp loop to/from utf8
    
    This test file had gotten kinda messy as new tasks were shoe horned into
    it.  This cleans it up, and positions it to be easier maintain going
    forward.  I tried to minimize the number of changes shown per commit,
    but this is the minimal I could get, and since it is a revamp, there are
    lots of differences.
    
    Some combinatorial explosion has been removed.
    
    A new subroutine is created which compares the expected vs actually
    gotten warnings, and is called in two places, removing duplicated code.
    
    This exposed a bug in very large, hence rare, code points.  It will be
    fixed in the next commit.  It was far easier to just make all similar
    tests TODO here, removing that in the next commit.  This means this
    commit has many passing TODOs

M       ext/XS-APItest/t/utf8_warn_base.pl

commit 601e92f1ff871a52b4fbb83b5061574a3541c8f3
Author: Karl Williamson <[email protected]>
Date:   Thu Jun 15 18:53:43 2017 -0600

    APItest/t/utf8_warn_base.pl: Tighten up tests
    
    This commit causes the tests to check that messages containing a code
    point have the correct exact wording, including the code point.  The
    tests are tightened up somewhat for other messages, but more is coming
    in a later commit.

M       ext/XS-APItest/t/utf8_warn_base.pl

commit bf422d6af5a52f0744708794cad460029b765902
Author: Karl Williamson <[email protected]>
Date:   Thu Jun 15 18:27:54 2017 -0600

    APItest/t/utf8_warn_base.pl: Skip most tests
    
    This test file tests every end-of-Unicode-plane noncharacter, and a
    middling surrogate, and a nonchar in the interior of the consecutive
    range of them.  But, we don't really have to do more than basic testing
    for these  middling cases.  We should test that they are detected as
    being in their respective categories, but testing that all combinations
    of warning and disallowed flags and return flags shouldn't be necessary.
    It's sufficient to test for those for the real edge cases.
    
    This cuts the number of tests in this file to somewhat less than 1/3 of
    the original.

M       ext/XS-APItest/t/utf8_warn_base.pl

commit 8d6f1506d59ced07c4dbd2a8cf8983050a3df51d
Author: Karl Williamson <[email protected]>
Date:   Sat Jun 17 06:27:59 2017 -0600

    APItest/t/utf8_warn_base.pl: Store warnings sans \n
    
    This will make the output more legible that future commits will create

M       ext/XS-APItest/t/utf8_warn_base.pl

commit d402d77fdafc7c56e640227277fc66556e971287
Author: Karl Williamson <[email protected]>
Date:   Thu Jun 15 17:35:54 2017 -0600

    APItest/t/utf8_warn_base.pl: Change some test names
    
    This omits distracting detail from subsidiary tests, indenting them from
    the major one.

M       ext/XS-APItest/t/utf8_warn_base.pl

commit af816908b3bd584a0cd5be501757e39360add754
Author: Karl Williamson <[email protected]>
Date:   Thu Jun 15 16:13:12 2017 -0600

    APItest/t/utf8_warn_base.pl: Simplify some calculations
    
    This commit pulls some variable setting outside an inner loop.  It's
    easily settable there, instead of being calculated.  It allows for
    removal of another hash element.

M       ext/XS-APItest/t/utf8_warn_base.pl

commit 2c511c58ef33086dbe1a3c74f8148d98a4168437
Author: Karl Williamson <[email protected]>
Date:   Thu Jun 15 15:45:14 2017 -0600

    APItest/t/utf8_warn_base.pl: Do formatting outside loop
    
    To save extra effort

M       ext/XS-APItest/t/utf8_warn_base.pl

commit a8ee5133c5c9c441dbcc47634e6d89e032237d9d
Author: Karl Williamson <[email protected]>
Date:   Thu Jun 15 15:00:08 2017 -0600

    APItest/t/utf8_warn_base.pl: Improve some more diagnostics
    
    This changes the diagnostics when testing utf8n_to_uvchr() so they are
    more human readable, and aren't generated until failure.
    
    It also corrects things to display $@ on eval failure (previously it
    displayed $!)

M       ext/XS-APItest/t/utf8_warn_base.pl

commit d884ea320c9cab99f6a561e7fc84844a14105f07
Author: Karl Williamson <[email protected]>
Date:   Thu Jun 15 14:24:05 2017 -0600

    APItest/t/utf8_warn_base.pl: Improve some diagnostics
    
    This creates a function that will display in more human-readable form
    the eval string used for testing uvchr_to_utf8().  And it calls that
    function should there be a failure.  Thus the calculations aren't done
    unless necessary.
    
    It also corrects a diagnostic to show $@ after an eval failure instead
    of $!

M       ext/XS-APItest/t/utf8_warn_base.pl

commit 5722c46dbd929c6e760b3760c459fa25971324c1
Author: Karl Williamson <[email protected]>
Date:   Thu Jun 15 12:49:10 2017 -0600

    APItest/t/utf8_warn_base.pl: Display mnemonics on error
    
    Part of the testing for this is that the returned flags for problematic
    conditions are correct.  This commit adds a routine that will convert
    numeric values of the flags into a mnemonic string like FOO|BAR|BAZ.
    This makes debugging easier.  The names are not computed unless there is
    an error.

M       ext/XS-APItest/t/utf8_warn_base.pl

commit 9cdc3054e97b95b79f4b6e0566bf72a937ee8399
Author: Karl Williamson <[email protected]>
Date:   Tue Jun 13 22:48:36 2017 -0600

    APItest/t/utf8_warn_base.pl: Rename some variables
    
    The new names more closely indicate the variables' purposes.

M       ext/XS-APItest/t/utf8_warn_base.pl

commit 67e454242dfcd7763ad6e64e1f14bb0fbe06add7
Author: Karl Williamson <[email protected]>
Date:   Fri Jun 30 11:55:18 2017 -0600

    APItest/t/utf8_warn_base.pl: Make hash element optional
    
    This element of the hash gives how many bytes are needed in an
    incomplete sequence in order to classify the full sequence.  In some
    cases every code point in the category has this be the same number, and
    it can be cleaner to not manually specify the number.

M       ext/XS-APItest/t/utf8_warn_base.pl

commit b7e1f4b29c383c46e80e729af67b495b52859702
Author: Karl Williamson <[email protected]>
Date:   Thu May 25 21:16:29 2017 -0600

    APItest/t/utf8_warn_base.pl: Remove hash elements
    
    These two elements can be calculated from the others

M       ext/XS-APItest/t/utf8_warn_base.pl

commit 3022ad002b5188e029536efe1a44c1799bb1564b
Author: Karl Williamson <[email protected]>
Date:   Thu May 25 21:04:09 2017 -0600

    APItest/t/utf8_warn_base.pl: Remove element from hash
    
    The warning message can be figured out from other elements.

M       ext/XS-APItest/t/utf8_warn_base.pl

commit b3169593bf2efcd9a2b6848304e79f5a6ac362fe
Author: Karl Williamson <[email protected]>
Date:   Thu May 25 20:09:07 2017 -0600

    APItest/t/utf8_warn_base.pl: Eliminate hash element
    
    This is leftover from an earlier version of the tests, and can be
    calculated instead of having to manually specify it.

M       ext/XS-APItest/t/utf8_warn_base.pl

commit 04f42bf61483ac70f2927c2fe03f542893091fcb
Author: Karl Williamson <[email protected]>
Date:   Wed Jun 14 15:24:29 2017 -0600

    APItest/t/utf8_warn_base.pl: Standardize overflow test detection
    
    There are two methods currently for detecting if a test is for overflow.
    This standardizes on the one where the expected code point is 0, and
    uses the already existing variable instead of qr//

M       ext/XS-APItest/t/utf8_warn_base.pl

commit 6b3df8c630fe72d3bdc58dd86ca5406d92ffcde5
Author: Karl Williamson <[email protected]>
Date:   Mon May 15 09:54:40 2017 -0600

    APItest/t/utf8.t: Don't test above IV_MAX
    
    For 32-bit platforms, this means moving  the tests to the 64-bit only 
portion of
    the file.  And it comments out the tests that are above 64-bit IV_MAX.
    
    This is in preparation for IV_MAX being the upper legal limit for code
    points.

M       ext/XS-APItest/t/utf8.t

commit c57307f27baa2bec4c17c982b73505a1353f3e3e
Author: Karl Williamson <[email protected]>
Date:   Wed Jul 5 11:31:12 2017 -0600

    APItest/t/utf8.t: Add a test
    
    This test will be important when we convert to limiting code points to
    at most IV_MAX.

M       ext/XS-APItest/t/utf8.t

commit 1b250a9a07fc58ce35953e26bd8910859b95b05c
Author: Karl Williamson <[email protected]>
Date:   Sat May 13 22:58:00 2017 -0600

    APItest/t/utf8.t: Comments, white-space only

M       ext/XS-APItest/t/utf8.t

commit af006f6a50ede09833f5f95c04242d1e8a1ae93a
Author: Karl Williamson <[email protected]>
Date:   Sat May 13 22:53:47 2017 -0600

    APItest/t/utf8.t: Better handle some platforms
    
    A future commit will cause some expected errors to not actually be
    errors on some platforms.  This detects and handles these.

M       ext/XS-APItest/t/utf8.t

commit 28d6a4972d7ee9862d0e1740e326d738560708c9
Author: Karl Williamson <[email protected]>
Date:   Sat May 13 22:51:43 2017 -0600

    APItest/t/utf8.t: Remove unnecessary hash initializations

M       ext/XS-APItest/t/utf8.t

commit 8769418b77682e2b83c96663928b294760a26c89
Author: Karl Williamson <[email protected]>
Date:   Sat May 13 22:50:26 2017 -0600

    APItest/t/utf8.t: Fix some convoluted code
    
    This code got overly complex as time went by, and can be cleaned up.

M       ext/XS-APItest/t/utf8.t

commit 8209b1d03e2da318e3b17805938d90283280d8f5
Author: Karl Williamson <[email protected]>
Date:   Mon May 8 09:47:41 2017 -0600

    APItest/t/utf8.t: Rmv useless line
    
    This entry is overwritten by the next line.

M       ext/XS-APItest/t/utf8.t

commit dbb8d79823035c58d8e52ea1c4bc1e5d16d1c2de
Author: Karl Williamson <[email protected]>
Date:   Mon Jun 26 22:27:23 2017 -0600

    APItest/t: Change some variable names
    
    One of these is used in multiple test files in this directory.
    
    The names are ambiguous for the contexts they occur in.  'first' can
    mean earliest in the string, but here it means the lowest ordinal value.

M       ext/XS-APItest/t/utf8.t
M       ext/XS-APItest/t/utf8_setup.pl
M       ext/XS-APItest/t/utf8_warn_base.pl

commit 1aff400115e5a6608b5a8c5232553e980d8eb79f
Author: Karl Williamson <[email protected]>
Date:   Thu Jun 15 12:01:15 2017 -0600

    APItest/t/utf8_setup.pl: Make sure diagnostics are on separate lines
    
    This changes diagnostic output to guarantee each element of the array
    starts on a new line, for easier readability.  The array may or may not
    already have terminating \n characters in the elements.

M       ext/XS-APItest/t/utf8_setup.pl

commit e86447a3e3d494d258a5d026cd89316d71397291
Author: Karl Williamson <[email protected]>
Date:   Mon May 29 20:58:32 2017 -0600

    APItest/t/utf8_setup.pl: Split function into two
    
    This function outputs a byte string as hex bytes.  A future commit will
    want that output without surrounding quotes, so create a version that
    doesn't have them.

M       ext/XS-APItest/t/utf8_setup.pl

commit 0618886662c89a3181a9de87ed3393ce45d973a2
Author: Karl Williamson <[email protected]>
Date:   Mon Jun 26 22:08:01 2017 -0600

    utf8n_to_uvchr(): Avoid some work
    
    By adding a single mask, we can avoid some unnecessary work, as that
    work is not necessary if just the one bit is set.

M       utf8.c

commit 56576a04ac0af02f0784b5519d4384ebcc27111d
Author: Karl Williamson <[email protected]>
Date:   Fri Jun 30 12:37:15 2017 -0600

    utf8.c: Comments, white-space only

M       utf8.c

commit c94c2f3967a7ba7fb43dc79d43c85dadd92f623b
Author: Karl Williamson <[email protected]>
Date:   Fri Jun 30 12:35:53 2017 -0600

    utf8.c: Consolidate duplicated string constants
    
    This reduces maintenance costs if they have to be updated.

M       utf8.c

commit abc28b5472dd6994953b2b1d8670f075ef541f71
Author: Karl Williamson <[email protected]>
Date:   Tue May 9 20:16:13 2017 -0600

    utf8.c: Don't calc code point from overflowing UTF8
    
    This avoids calculating a code point from UTF-8 that is known to
    overflow.  This could give incorrect results (used only in warning
    messages), but is done only when there are 3 (or more) malformations:
    overflow, overlong, UTF-8 terminated early, so it's unlikely to actually
    happen in the field.
    
    I am not adding any tests, as I don't know of any existing failures, and
    soon there will be a commit that limits code points to be at most
    IV_MAX.  That commit will cause cause existing tests to fail without
    this fix, so that is good enough to test it.  I imagine a brute force
    generator of UTF-8 would find some string that showed this problem up,
    absent the other coming changes, but it's not worth it.

M       utf8.c

commit f70e3f269d1ce428565a941ad4385fdb662a49b7
Author: Karl Williamson <[email protected]>
Date:   Mon Jul 3 18:59:50 2017 -0600

    t/uni/parser.t: Skip some tests on 32-bit platforms
    
    These tests require code points that are too large for 32-bit platforms,
    so skip there.

M       t/uni/parser.t

commit 51316de8a00907e00a42d73870c3940e2b4a49f6
Author: Karl Williamson <[email protected]>
Date:   Tue May 9 20:27:40 2017 -0600

    Move test from t/opbasic to t/uni
    
    This test is really not very basic, so it doesn't belong in opbasic.  It
    is for having a string delimiter be a very large code point, well above
    the legal strict Unicode max.  The code point is 2**32 - 1, which is
    UV_MAX on 32-bit platforms.
    
    Use of UV_MAX for a delimiter is about to become illegal, and so this
    test needs to be skipped on these.  Since this is compile time, there
    are a few complications in getting the script to compile on such
    systems, even though it is skipped at run time.
    
    The opbasic test file is so basic that it doesn't even use t/test.pl,
    whereas the one in t/uni does use that, and that has better
    infrastructure for handling this issue, including getting it to work on
    EBCDIC platforms.

M       t/opbasic/qq.t
M       t/uni/parser.t

commit 10463ce6abea3eb975addcdf1480ac8b1bcf9769
Author: Karl Williamson <[email protected]>
Date:   Mon Jul 3 11:30:52 2017 -0600

    t/comp/parser.t: Skip test on 32-bit builds
    
    This code point is no longer legal on such builds.  We need to use this
    code point to trigger the bug, so can't lower it to work on those
    builds.

M       t/comp/parser.t

commit 713375f8fb3fe893172750a56079759d3a7f8e83
Author: Karl Williamson <[email protected]>
Date:   Mon Jul 3 13:52:31 2017 -0600

    t/op/index.t: Skip now illegal code points on 32 bit builds
    
    These tests use code points that are now illegal on 32-bit platforms, so
    skip them there.  The failures these tests were added for did not happen
    except on these now-illegal code points.

M       t/op/index.t

commit 5139efdda3fe1fa1c39a7c4d3f8d48f71debf1f0
Author: Karl Williamson <[email protected]>
Date:   Mon Jul 3 09:33:09 2017 -0600

    t/op/chop.t: Don't use too large code points
    
    The bug this was testing for requires a code point that will no longer
    be legal on 32-bit machines.  So skip unless we're on a 64-bit platform,
    and revise to use chr() in the skipped code instead of "\x{}".  The
    latter would try to be compiled even if execution gets skipped, so would
    cause it to die, whereas chr() is runtime, so get skipped if
    inappropriate.  This also tested the very highest legal code point on
    64-bit machines, which is now illegal, so test the new very highest one.

M       t/op/chop.t

commit 56a85032922b2ccda478b54ad82410492b69d74c
Author: Karl Williamson <[email protected]>
Date:   Sun Jul 2 10:34:12 2017 -0600

    t/re/pat_advanced.t: Revise some tests
    
    These tests used the highest available code points, but those will soon
    be made illegal.  The tests don't need to be for these particular code
    points, but there do need to be tests of user-defined properties of high
    code points, so this commit changes to use the highest ones that will be
    legal after that change.

M       t/re/pat_advanced.t

commit 1ddd0324d8c362b22a8000c5d0c115397a556f59
Author: Karl Williamson <[email protected]>
Date:   Mon Jul 3 13:46:42 2017 -0600

    Restore a portion of reverted commits
    
    See the previous commit for details.

M       t/lib/warnings/utf8
M       t/op/ver.t

commit 76513bdc5d9e7bddc7d5da43b64755a51aea8673
Author: Karl Williamson <[email protected]>
Date:   Mon Jul 3 12:26:34 2017 -0600

    Revert: Restrict code points to <= IV_MAX
    
    This reverts the two related commits
    51099b64db323d0e1d871837f619d72bea8ca2f9  (partially)
    13f4dd346e6f3b61534a20f246de3a80b3feb743  (entirely)
    
    I was in the middle of a long branch dealing with this and related
    issues when these were pushed to blead.  It was far easier for me to
    revert these at the beginning of my branch than to try to rebase
    unreverted.  And there are changes needed to the approaches taken in the
    reverted commits.  A third related commit,
    113b8661ce6d987db4dd217e2f90cbb983ce5d00, doesn't cause problems so
    isn't reverted.
    
    I reverted the second commit, then the first one, and squashed them
    together into this one.  No other changes were done in this commit.
    The reason for the squashing is to avoid problems when bisecting on a
    32-bit machine.  If the bisect landed between the commits, it could show
    failures.  The portion of the first commit that wasn't reverted was the
    part that was rendered moot because of the changes in the meantime that
    forbid bitwise operations on strings containing code points above
    Latin1.
    
    The next commit in this series will reinstate portions of these commits.
    I reverted as much as possible here to make this reversion commit
    cleaner.
    
    The biggest problem with these commits, is that some Perl applications
    are made vulnerable to Denial of Service attacks.  I do believe it is ok
    to croak when a program tries, for example, to do chr() of too large a
    number, which is what the reverted commit does (and what this branch
    will eventually reinstate doing).  But when parsing UTF-8, you can't
    just die if you find something too large.  That would be an easy DOS on
    any program, such as a web server, that gets its UTF-8 from the public.
    Perl already has a means to deal with too-large code points (before
    5.26, this was those code points that overflow the word size), and web
    servers should have already been written in such a way as to deal with
    these.  This branch just adapts the code so that anything above IV_MAX
    is considered to be overflowing.  Web servers should not have to change
    as a result.
    
    A second issue is that one of the reasons we did the original
    deprecation is so that we can use the forbidden code points internally
    ourselves, such as Perl 6 does to store Grapheme Normal Form.  The
    implementation should not burn bridges, but allow that use to easily
    happen when the time comes.  For that reason, some tests should not be
    deleted, but commented out, so they can be quickly adapted.
    
    While working on this branch, I found several unlikely-to-occur bugs in
    the existing code.  These should be fixed now in the code that handles
    up to UV_MAX code points, so that when we do allow internal use of such,
    the bugs are already gone.
    
    I also had researched the tests that fail as a result of the IV_MAX
    restriction.  Some of the test changes in these reverted commits were
    inappropriate.
    
    For example, some tests that got changed were for bugs that happen only
    on code points that are now illegal on 32-bit builds.  Lowering the code
    point in the test to a legal value, as was done in some instances,  no
    longer tests for the original bug.  Instead, where I found this, I just
    skip the test on 32-bit platforms.
    
    Other tests were simply deleted, where a lower code point would have
    worked, and the test is useful with a lower code point.  I retain such
    tests, using a lower code point.  In some cases, it was probably ok to
    delete the tests on 32-bit platforms, as something was retained for a
    64-bit one, but since I had already done the adaptive work, I retain
    that.
    
    And still other tests were from files that I extensively revamp, so I
    went with the revamp.
    
    The following few commits fix those as far as possible now.  This is so
    that the reversion of the tests and my changes are close together in the
    final commit series.  Some changes have to wait to later, as for those
    where the entire test files are revamped, or when the deprecation
    messages finally go away in the final commit of this series.
    
    In cases where the message wording I was contemplating using conflicts
    with the reverted commits, I change mine to use that of the reverted
    commits.

M       ext/XS-APItest/t/utf8.t
M       ext/XS-APItest/t/utf8_warn_base.pl
M       t/comp/parser.t
M       t/lib/warnings/utf8
M       t/op/chop.t
M       t/op/index.t
M       t/op/ver.t
M       t/opbasic/qq.t
M       t/re/pat_advanced.t
M       t/uni/parser.t
M       utf8.c
-----------------------------------------------------------------------

Summary of changes:
 MANIFEST                           |    1 -
 embed.fnc                          |   15 +-
 embed.h                            |    6 +-
 ext/XS-APItest/t/utf8.t            |  152 +--
 ext/XS-APItest/t/utf8_malformed.t  |  418 ------
 ext/XS-APItest/t/utf8_setup.pl     |   30 +-
 ext/XS-APItest/t/utf8_warn_base.pl | 2505 +++++++++++++++++++++++-------------
 inline.h                           |   14 +-
 proto.h                            |   22 +-
 t/comp/parser.t                    |   17 +-
 t/lib/warnings/utf8                |   24 +-
 t/op/chop.t                        |   24 +-
 t/op/index.t                       |   13 +-
 t/op/utf8decode.t                  |   20 +-
 t/op/ver.t                         |    1 -
 t/re/pat_advanced.t                |   41 +-
 t/uni/parser.t                     |   22 +-
 utf8.c                             | 1174 ++++++++++-------
 utf8.h                             |   64 +-
 utfebcdic.h                        |    2 +
 20 files changed, 2568 insertions(+), 1997 deletions(-)
 delete mode 100644 ext/XS-APItest/t/utf8_malformed.t

diff --git a/MANIFEST b/MANIFEST
index 38b908e90d..8d2e36258c 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -4321,7 +4321,6 @@ ext/XS-APItest/t/temp_lv_sub.t    XS::APItest: tests for 
lvalue subs returning temp
 ext/XS-APItest/t/underscore_length.t   Test find_rundefsv()
 ext/XS-APItest/t/utf16_to_utf8.t       Test behaviour of 
utf16_to_utf8{,reversed}
 ext/XS-APItest/t/utf8.t                Tests for code in utf8.c
-ext/XS-APItest/t/utf8_malformed.t      Tests for code in utf8.c
 ext/XS-APItest/t/utf8_setup.pl Tests for code in utf8.c
 ext/XS-APItest/t/utf8_warn00.t Tests for code in utf8.c
 ext/XS-APItest/t/utf8_warn01.t Tests for code in utf8.c
diff --git a/embed.fnc b/embed.fnc
index 1e661e4f3f..7c6710a4fd 100644
--- a/embed.fnc
+++ b/embed.fnc
@@ -744,7 +744,11 @@ ADMpR      |bool   |isALNUM_lazy   |NN const char* p
 #ifdef PERL_IN_UTF8_C
 snR    |U8     |to_lower_latin1|const U8 c|NULLOK U8 *p|NULLOK STRLEN *lenp  \
                |const char dummy
-inR    |bool   |is_utf8_cp_above_31_bits|NN const U8 * const s|NN const U8 * 
const e
+#  ifndef UV_IS_QUAD
+snR    |int    |is_utf8_cp_above_31_bits|NN const U8 * const s             \
+                                        |NN const U8 * const e             \
+                                        |const bool consider_overlongs
+#  endif
 #endif
 #if defined(PERL_IN_UTF8_C) || defined(PERL_IN_REGCOMP_C) || 
defined(PERL_IN_REGEXEC_C)
 EXp    |UV        |_to_fold_latin1|const U8 c|NN U8 *p|NN STRLEN *lenp|const 
unsigned int flags
@@ -1720,9 +1724,12 @@ EpM      |char * |_byte_dump_string                      
                \
                                |const STRLEN len                       \
                                |const bool format
 #if defined(PERL_IN_UTF8_C)
-inR    |bool   |does_utf8_overflow|NN const U8 * const s|NN const U8 * e
-inR    |bool   |is_utf8_overlong_given_start_byte_ok|NN const U8 * const 
s|const STRLEN len
-inR    |bool   |isFF_OVERLONG  |NN const U8 * const s|const STRLEN len
+inR    |int    |does_utf8_overflow|NN const U8 * const s               \
+                                  |NN const U8 * e                     \
+                                  |const bool consider_overlongs
+inR    |int    |is_utf8_overlong_given_start_byte_ok|NN const U8 * const s \
+                                                    |const STRLEN len
+inR    |int    |isFF_OVERLONG  |NN const U8 * const s|const STRLEN len
 sMR    |char * |unexpected_non_continuation_text                       \
                |NN const U8 * const s                                  \
                |STRLEN print_len                                       \
diff --git a/embed.h b/embed.h
index a74458d7fd..608d252e54 100644
--- a/embed.h
+++ b/embed.h
@@ -1472,6 +1472,11 @@
 #define mulexp10               S_mulexp10
 #    endif
 #  endif
+#  if !defined(UV_IS_QUAD)
+#    if defined(PERL_IN_UTF8_C)
+#define is_utf8_cp_above_31_bits       S_is_utf8_cp_above_31_bits
+#    endif
+#  endif
 #  if !defined(WIN32)
 #define do_exec3(a,b,c)                Perl_do_exec3(aTHX_ a,b,c)
 #  endif
@@ -1844,7 +1849,6 @@
 #define isFF_OVERLONG          S_isFF_OVERLONG
 #define is_utf8_common(a,b,c,d)        S_is_utf8_common(aTHX_ a,b,c,d)
 #define is_utf8_common_with_len(a,b,c,d,e)     S_is_utf8_common_with_len(aTHX_ 
a,b,c,d,e)
-#define is_utf8_cp_above_31_bits       S_is_utf8_cp_above_31_bits
 #define is_utf8_overlong_given_start_byte_ok   
S_is_utf8_overlong_given_start_byte_ok
 #define swash_scan_list_line(a,b,c,d,e,f,g)    S_swash_scan_list_line(aTHX_ 
a,b,c,d,e,f,g)
 #define swatch_get(a,b,c)      S_swatch_get(aTHX_ a,b,c)
diff --git a/ext/XS-APItest/t/utf8.t b/ext/XS-APItest/t/utf8.t
index 95e2628f57..37c65aa8e1 100644
--- a/ext/XS-APItest/t/utf8.t
+++ b/ext/XS-APItest/t/utf8.t
@@ -13,9 +13,6 @@ BEGIN {
 
 $|=1;
 
-no warnings 'deprecated'; # Some of the below are above IV_MAX on 32 bit
-                          # machines, and that is tested elsewhere
-
 use XS::APItest;
 
 my $pound_sign = chr utf8::unicode_to_native(163);
@@ -32,8 +29,8 @@ my $look_for_everything_utf8n_to
                        | $::UTF8_WARN_NONCHAR
                        | $::UTF8_DISALLOW_SUPER
                        | $::UTF8_WARN_SUPER
-                       | $::UTF8_DISALLOW_ABOVE_31_BIT
-                       | $::UTF8_WARN_ABOVE_31_BIT;
+                       | $::UTF8_DISALLOW_PERL_EXTENDED
+                       | $::UTF8_WARN_PERL_EXTENDED;
 my $look_for_everything_uvchr_to
                         = $::UNICODE_DISALLOW_SURROGATE
                        | $::UNICODE_WARN_SURROGATE
@@ -41,8 +38,8 @@ my $look_for_everything_uvchr_to
                        | $::UNICODE_WARN_NONCHAR
                        | $::UNICODE_DISALLOW_SUPER
                        | $::UNICODE_WARN_SUPER
-                       | $::UNICODE_DISALLOW_ABOVE_31_BIT
-                       | $::UNICODE_WARN_ABOVE_31_BIT;
+                       | $::UNICODE_DISALLOW_PERL_EXTENDED
+                       | $::UNICODE_WARN_PERL_EXTENDED;
 
 foreach ([0, '', '', 'empty'],
         [0, 'N', 'N', '1 char'],
@@ -72,9 +69,9 @@ foreach ([0, '', '', 'empty'],
 # are adjacent to problematic code points, so we want to make sure they aren't
 # considered problematic.
 my %code_points = (
-    0x0100     => (isASCII) ? "\xc4\x80" : I8_to_native("\xc8\xa0"),
-    0x0400 - 1 => (isASCII) ? "\xcf\xbf" : I8_to_native("\xdf\xbf"),
-    0x0400     => (isASCII) ? "\xd0\x80" : I8_to_native("\xe1\xa0\xa0"),
+    0x0100     => (isASCII) ? "\xc4\x80"     : I8_to_native("\xc8\xa0"),
+    0x0400 - 1 => (isASCII) ? "\xcf\xbf"     : I8_to_native("\xdf\xbf"),
+    0x0400     => (isASCII) ? "\xd0\x80"     : I8_to_native("\xe1\xa0\xa0"),
     0x0800 - 1 => (isASCII) ? "\xdf\xbf"     : I8_to_native("\xe1\xbf\xbf"),
     0x0800     => (isASCII) ? "\xe0\xa0\x80" : I8_to_native("\xe2\xa0\xa0"),
     0x4000 - 1 => (isASCII) ? "\xe3\xbf\xbf" : I8_to_native("\xef\xbf\xbf"),
@@ -95,11 +92,10 @@ my %code_points = (
     0xD7FF     => (isASCII) ? "\xed\x9f\xbf" : 
I8_to_native("\xf1\xb5\xbf\xbf"),
     0xD800     => (isASCII) ? "\xed\xa0\x80" : 
I8_to_native("\xf1\xb6\xa0\xa0"),
     0xDC00     => (isASCII) ? "\xed\xb0\x80" : 
I8_to_native("\xf1\xb7\xa0\xa0"),
-    0xDFFF     => (isASCII) ? "\xee\x80\x80" : 
I8_to_native("\xf1\xb8\xa0\xa0"),
     0xDFFF     => (isASCII) ? "\xed\xbf\xbf" : 
I8_to_native("\xf1\xb7\xbf\xbf"),
     0xE000     => (isASCII) ? "\xee\x80\x80" : 
I8_to_native("\xf1\xb8\xa0\xa0"),
 
-    # Include the 32 contiguous non characters, and surrounding code points
+    # Include the 32 contiguous non characters, and adjacent code points
     0xFDCF     => (isASCII) ? "\xef\xb7\x8f" : 
I8_to_native("\xf1\xbf\xae\xaf"),
     0xFDD0     => (isASCII) ? "\xef\xb7\x90" : 
I8_to_native("\xf1\xbf\xae\xb0"),
     0xFDD1     => (isASCII) ? "\xef\xb7\x91" : 
I8_to_native("\xf1\xbf\xae\xb1"),
@@ -381,23 +377,21 @@ my %code_points = (
     0x40000000     =>
     (isASCII) ?    "\xfd\x80\x80\x80\x80\x80"
     : I8_to_native("\xff\xa0\xa0\xa0\xa0\xa0\xa0\xa1\xa0\xa0\xa0\xa0\xa0\xa0"),
+    0x80000000 - 1 =>
+    (isASCII) ?    "\xfd\xbf\xbf\xbf\xbf\xbf"
+    : I8_to_native("\xff\xa0\xa0\xa0\xa0\xa0\xa0\xa1\xbf\xbf\xbf\xbf\xbf\xbf"),
 );
 
 if ($::is64bit) {
     no warnings qw(overflow portable);
-
-    $code_points{0x80000000 - 1}
-     = (isASCII)
-     ?    "\xfd\xbf\xbf\xbf\xbf\xbf"
-     : 
I8_to_native("\xff\xa0\xa0\xa0\xa0\xa0\xa0\xa1\xbf\xbf\xbf\xbf\xbf\xbf"),
     $code_points{0x80000000}
-     = (isASCII)
-     ?    "\xfe\x82\x80\x80\x80\x80\x80"
-     : 
I8_to_native("\xff\xa0\xa0\xa0\xa0\xa0\xa0\xa2\xa0\xa0\xa0\xa0\xa0\xa0"),
+    = (isASCII)
+    ?              "\xfe\x82\x80\x80\x80\x80\x80"
+    : I8_to_native("\xff\xa0\xa0\xa0\xa0\xa0\xa0\xa2\xa0\xa0\xa0\xa0\xa0\xa0");
     $code_points{0xFFFFFFFF}
-     = (isASCII)
-     ?    "\xfe\x83\xbf\xbf\xbf\xbf\xbf"
-     : 
I8_to_native("\xff\xa0\xa0\xa0\xa0\xa0\xa0\xa3\xbf\xbf\xbf\xbf\xbf\xbf"),
+    = (isASCII)
+    ?              "\xfe\x83\xbf\xbf\xbf\xbf\xbf"
+    : I8_to_native("\xff\xa0\xa0\xa0\xa0\xa0\xa0\xa3\xbf\xbf\xbf\xbf\xbf\xbf");
     $code_points{0x100000000}
      = (isASCII)
      ?              "\xfe\x84\x80\x80\x80\x80\x80"
@@ -410,6 +404,16 @@ if ($::is64bit) {
      = (isASCII)
      ?              "\xff\x80\x80\x80\x80\x80\x81\x80\x80\x80\x80\x80\x80"
      : 
I8_to_native("\xff\xa0\xa0\xa0\xa0\xa0\xa2\xa0\xa0\xa0\xa0\xa0\xa0\xa0");
+    $code_points{0x7FFFFFFFFFFFFFFF}
+     = (isASCII)
+     ?              "\xff\x80\x87\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf"
+     : 
I8_to_native("\xff\xa7\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf");
+
+    # This is used when UV_MAX is the upper limit of acceptable code points
+    # $code_points{0xFFFFFFFFFFFFFFFF}
+    # = (isASCII)
+    # ?              "\xff\x80\x8f\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf"
+    # : 
I8_to_native("\xff\xaf\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf");
 
     if (isASCII) {  # These could falsely show as overlongs in a naive
                     # implementation
@@ -450,17 +454,17 @@ while ($cp < ((isASCII) ? 128 : 160)) {   # This is from 
the definition of
 # continuation bytes can be in, and what the lowest start byte can be.  So we
 # cycle through them.
 
-my $final_continuation = 0xBF;
+my $highest_continuation = 0xBF;
 my $start = (isASCII) ? 0xC2 : 0xC5;
 
-my $continuation = $::first_continuation - 1;
+my $continuation = $::lowest_continuation - 1;
 
 while ($cp < 255) {
-    if (++$continuation > $final_continuation) {
+    if (++$continuation > $highest_continuation) {
 
         # Wrap to the next start byte when we reach the final continuation
         # byte possible
-        $continuation = $::first_continuation;
+        $continuation = $::lowest_continuation;
         $start++;
     }
     $code_points{$cp} = I8_to_native(chr($start) . chr($continuation));
@@ -475,11 +479,6 @@ local $SIG{__WARN__} = sub { push @warnings, @_ };
 
 my %restriction_types;
 
-$restriction_types{""}{'valid_strings'} = "";
-$restriction_types{"c9strict"}{'valid_strings'} = "";
-$restriction_types{"strict"}{'valid_strings'} = "";
-$restriction_types{"fits_in_31_bits"}{'valid_strings'} = "";
-
 # This set of tests looks for basic sanity, and lastly tests various routines
 # for the given code point.  If the earlier tests for that code point fail,
 # the later ones probably will too.  Malformations are tested in later
@@ -621,15 +620,15 @@ for my $u (sort { utf8::unicode_to_native($a) <=> 
utf8::unicode_to_native($b) }
 
     my $valid_under_strict = 1;
     my $valid_under_c9strict = 1;
-    my $valid_for_fits_in_31_bits = 1;
+    my $valid_for_not_extended_utf8 = 1;
     if ($n > 0x10FFFF) {
         $this_utf8_flags &= ~($::UTF8_DISALLOW_SUPER|$::UTF8_WARN_SUPER);
         $valid_under_strict = 0;
         $valid_under_c9strict = 0;
         if ($n > 2 ** 31 - 1) {
             $this_utf8_flags &=
-                        
~($::UTF8_DISALLOW_ABOVE_31_BIT|$::UTF8_WARN_ABOVE_31_BIT);
-            $valid_for_fits_in_31_bits = 0;
+                ~($::UTF8_DISALLOW_PERL_EXTENDED|$::UTF8_WARN_PERL_EXTENDED);
+            $valid_for_not_extended_utf8 = 0;
         }
     }
     elsif (($n >= 0xFDD0 && $n <= 0xFDEF) || ($n & 0xFFFE) == 0xFFFE) {
@@ -785,17 +784,18 @@ for my $u (sort { utf8::unicode_to_native($a) <=> 
utf8::unicode_to_native($b) }
     my $this_uvchr_flags = $look_for_everything_uvchr_to;
     if ($n > 2 ** 31 - 1) {
         $this_uvchr_flags &=
-                
~($::UNICODE_DISALLOW_ABOVE_31_BIT|$::UNICODE_WARN_ABOVE_31_BIT);
+            ~($::UNICODE_DISALLOW_PERL_EXTENDED|$::UNICODE_WARN_PERL_EXTENDED);
     }
     if ($n > 0x10FFFF) {
         $this_uvchr_flags &= 
~($::UNICODE_DISALLOW_SUPER|$::UNICODE_WARN_SUPER);
     }
     elsif (($n >= 0xFDD0 && $n <= 0xFDEF) || ($n & 0xFFFE) == 0xFFFE) {
-        $this_uvchr_flags &= 
~($::UNICODE_DISALLOW_NONCHAR|$::UNICODE_WARN_NONCHAR);
+        $this_uvchr_flags
+                     &= ~($::UNICODE_DISALLOW_NONCHAR|$::UNICODE_WARN_NONCHAR);
     }
     elsif ($n >= 0xD800 && $n <= 0xDFFF) {
         $this_uvchr_flags
-                     &= 
~($::UNICODE_DISALLOW_SURROGATE|$::UNICODE_WARN_SURROGATE);
+                &= ~($::UNICODE_DISALLOW_SURROGATE|$::UNICODE_WARN_SURROGATE);
     }
     $display_flags = sprintf "0x%x", $this_uvchr_flags;
 
@@ -845,17 +845,17 @@ for my $u (sort { utf8::unicode_to_native($a) <=> 
utf8::unicode_to_native($b) }
                                 = $restriction_types{"strict"}{'valid_counts'};
     }
 
-    if ($valid_for_fits_in_31_bits) {
-        $restriction_types{"fits_in_31_bits"}{'valid_strings'} .= $bytes;
-        $restriction_types{"fits_in_31_bits"}{'valid_counts'}++;
+    if ($valid_for_not_extended_utf8) {
+        $restriction_types{"not_extended_utf8"}{'valid_strings'} .= $bytes;
+        $restriction_types{"not_extended_utf8"}{'valid_counts'}++;
     }
     elsif (! exists
-                $restriction_types{"fits_in_31_bits"}{'first_invalid_offset'})
+                
$restriction_types{"not_extended_utf8"}{'first_invalid_offset'})
     {
-        $restriction_types{"fits_in_31_bits"}{'first_invalid_offset'}
-                = length 
$restriction_types{"fits_in_31_bits"}{'valid_strings'};
-        $restriction_types{"fits_in_31_bits"}{'first_invalid_count'}
-                        = 
$restriction_types{"fits_in_31_bits"}{'valid_counts'};
+        $restriction_types{"not_extended_utf8"}{'first_invalid_offset'}
+                = length 
$restriction_types{"not_extended_utf8"}{'valid_strings'};
+        $restriction_types{"not_extended_utf8"}{'first_invalid_count'}
+                        = 
$restriction_types{"not_extended_utf8"}{'valid_counts'};
     }
 }
 
@@ -869,16 +869,13 @@ my $p = (isASCII) ? "\xe1\x80" : 
I8_to_native("\xE4\xA0");  # partial
 for my $restriction (sort keys %restriction_types) {
     use bytes;
 
-    next if $restriction eq 'fits_in_31_bits'
-         && !defined 
$restriction_types{"fits_in_31_bits"}{'first_invalid_offset'};
-
     for my $use_flags ("", "_flags") {
 
         # For each restriction, we test it in both the is_foo_flags functions
         # and the specially named foo function.  But not if there isn't such a
         # specially named function.  Currently, this is the only tested
         # restriction that doesn't have a specially named function
-        next if $use_flags eq "" && $restriction eq "fits_in_31_bits";
+        next if $use_flags eq "" && $restriction eq "not_extended_utf8";
 
         # Start building up the name of the function we will test.
         my $base_name = "is_";
@@ -890,7 +887,7 @@ for my $restriction (sort keys %restriction_types) {
         # We test both "is_utf8_string_foo" and "is_fixed_width_buf" functions
         foreach my $operand ('string', 'fixed_width_buf') {
 
-            # Currently, the only fixed_width_buf functions have the '_flags'
+            # Currently, only fixed_width_buf functions have the '_flags'
             # suffix.
             next if $operand eq 'fixed_width_buf' && $use_flags eq "";
 
@@ -905,7 +902,7 @@ for my $restriction (sort keys %restriction_types) {
                 #      continuation character to the valid string
                 #   c) input created by appending a partial character.  This
                 #      is valid in the 'fixed_width' functions, but invalid in
-                #   the 'string' ones
+                #      the 'string' ones
                 #   d) invalid input created by calling a function that is
                 #      expecting a restricted form of the input using the 
string
                 #      that's valid when unrestricted
@@ -949,42 +946,29 @@ for my $restriction (sort keys %restriction_types) {
                                         = 0 if $operand eq "fixed_width_buf";
                             }
                         }
+                        elsif (! exists $restriction_types
+                                    {$this_error_type}{'first_invalid_count'})
+                        {
+                            # If no errors were found, this is entirely valid.
+                            $this_error_type = 0;
+                        }
                         else {
-                            $test_name_suffix
-                                        = " if contains forbidden code points";
-                            if ($this_error_type eq "c9strict") {
-                                $bytes = 
$restriction_types{""}{'valid_strings'};
-                                $expected_offset
-                                 = $restriction_types{"c9strict"}
-                                                     {'first_invalid_offset'};
-                                $expected_count
-                                  = $restriction_types{"c9strict"}
-                                                      {'first_invalid_count'};
-                            }
-                            elsif ($this_error_type eq "strict") {
-                                $bytes = 
$restriction_types{""}{'valid_strings'};
-                                $expected_offset
-                                  = $restriction_types{"strict"}
-                                                      {'first_invalid_offset'};
-                                $expected_count
-                                  = $restriction_types{"strict"}
-                                                      {'first_invalid_count'};
 
-                            }
-                            elsif ($this_error_type eq "fits_in_31_bits") {
-                                $bytes = 
$restriction_types{""}{'valid_strings'};
-                                $expected_offset
-                                  = $restriction_types{"fits_in_31_bits"}
-                                                      {'first_invalid_offset'};
-                                $expected_count
-                                    = $restriction_types{"fits_in_31_bits"}
-                                                        
{'first_invalid_count'};
-                            }
-                            else {
+                            if (! exists $restriction_types{$this_error_type}) 
{
                                 fail("Internal test error: Unknown error type "
                                 . "'$this_error_type'");
                                 next;
                             }
+                            $test_name_suffix
+                                        = " if contains forbidden code points";
+
+                            $bytes = $restriction_types{""}{'valid_strings'};
+                            $expected_offset
+                                 = $restriction_types{$this_error_type}
+                                                     {'first_invalid_offset'};
+                            $expected_count
+                                  = $restriction_types{$this_error_type }
+                                                      {'first_invalid_count'};
                         }
                     }
 
@@ -1011,8 +995,8 @@ for my $restriction (sort keys %restriction_types) {
                             elsif ($restriction eq "strict") {
                                 $test .= ", 
$::UTF8_DISALLOW_ILLEGAL_INTERCHANGE";
                             }
-                            elsif ($restriction eq "fits_in_31_bits") {
-                                $test .= ", $::UTF8_DISALLOW_ABOVE_31_BIT";
+                            elsif ($restriction eq "not_extended_utf8") {
+                                $test .= ", $::UTF8_DISALLOW_PERL_EXTENDED";
                             }
                             else {
                                 fail("Internal test error: Unknown restriction 
"
diff --git a/ext/XS-APItest/t/utf8_malformed.t 
b/ext/XS-APItest/t/utf8_malformed.t
deleted file mode 100644
index 16c5b7f437..0000000000
--- a/ext/XS-APItest/t/utf8_malformed.t
+++ /dev/null
@@ -1,418 +0,0 @@
-#!perl -w
-
-# Test handling of various UTF-8 malformations
-
-use strict;
-use Test::More;
-
-BEGIN {
-    use_ok('XS::APItest');
-    require 'charset_tools.pl';
-    require './t/utf8_setup.pl';
-};
-
-$|=1;
-
-no warnings 'deprecated'; # Some of the below are above IV_MAX on 32 bit
-                          # machines, and that is tested elsewhere
-
-use XS::APItest;
-
-my @warnings;
-
-use warnings 'utf8';
-local $SIG{__WARN__} = sub { push @warnings, @_ };
-
-my $I8c = $::I8c;
-
-my $REPLACEMENT = 0xFFFD;
-
-# Now test the malformations.  All these raise category utf8 warnings.
-my @malformations = (
-    # ($testname, $bytes, $length, $allow_flags, $expected_error_flags,
-    #  $allowed_uv, $expected_len, $needed_to_discern_len, $message )
-
-# Now considered a program bug, and asserted against
-    #[ "zero length string malformation", "", 0,
-    #    $::UTF8_ALLOW_EMPTY, $::UTF8_GOT_EMPTY, $REPLACEMENT, 0, 0,
-    #    qr/empty string/
-    #],
-    [ "orphan continuation byte malformation", I8_to_native("${I8c}a"), 2,
-        $::UTF8_ALLOW_CONTINUATION, $::UTF8_GOT_CONTINUATION, $REPLACEMENT,
-        1, 1,
-        qr/unexpected continuation byte/
-    ],
-    [ "premature next character malformation (immediate)",
-        (isASCII) ? "\xc2\xc2\x80" : I8_to_native("\xc5\xc5\xa0"),
-        3,
-        $::UTF8_ALLOW_NON_CONTINUATION, $::UTF8_GOT_NON_CONTINUATION, 
$REPLACEMENT,
-        1, 2,
-        qr/unexpected non-continuation byte.*immediately after start byte/
-    ],
-    [ "premature next character malformation (non-immediate)",
-        I8_to_native("\xef${I8c}a"), 3,
-        $::UTF8_ALLOW_NON_CONTINUATION, $::UTF8_GOT_NON_CONTINUATION, 
$REPLACEMENT,
-        2, 3,
-        qr/unexpected non-continuation byte .* 2 bytes after start byte/
-    ],
-    [ "too short malformation", I8_to_native("\xf1${I8c}a"), 2,
-        # Having the 'a' after this, but saying there are only 2 bytes also
-        # tests that we pay attention to the passed in length
-        $::UTF8_ALLOW_SHORT, $::UTF8_GOT_SHORT, $REPLACEMENT,
-        2, 2,
-        qr/2 bytes available, need 4/
-    ],
-    [ "overlong malformation, lowest 2-byte",
-        (isASCII) ? "\xc0\x80" : I8_to_native("\xc0\xa0"),
-        2,
-        $::UTF8_ALLOW_LONG, $::UTF8_GOT_LONG,
-        0,   # NUL
-        2, 1,
-        qr/overlong/
-    ],
-    [ "overlong malformation, highest 2-byte",
-        (isASCII) ? "\xc1\xbf" : I8_to_native("\xc4\xbf"),
-        2,
-        $::UTF8_ALLOW_LONG, $::UTF8_GOT_LONG,
-        (isASCII) ? 0x7F : utf8::unicode_to_native(0x9F),
-        2, 1,
-        qr/overlong/
-    ],
-    [ "overlong malformation, lowest 3-byte",
-        (isASCII) ? "\xe0\x80\x80" : I8_to_native("\xe0\xa0\xa0"),
-        3,
-        $::UTF8_ALLOW_LONG, $::UTF8_GOT_LONG,
-        0,   # NUL
-        3, (isASCII) ? 2 : 1,
-        qr/overlong/
-    ],
-    [ "overlong malformation, highest 3-byte",
-        (isASCII) ? "\xe0\x9f\xbf" : I8_to_native("\xe0\xbf\xbf"),
-        3,
-        $::UTF8_ALLOW_LONG, $::UTF8_GOT_LONG,
-        (isASCII) ? 0x7FF : 0x3FF,
-        3, (isASCII) ? 2 : 1,
-        qr/overlong/
-    ],
-    [ "overlong malformation, lowest 4-byte",
-        (isASCII) ? "\xf0\x80\x80\x80" : I8_to_native("\xf0\xa0\xa0\xa0"),
-        4,
-        $::UTF8_ALLOW_LONG, $::UTF8_GOT_LONG,
-        0,   # NUL
-        4, 2,
-        qr/overlong/
-    ],
-    [ "overlong malformation, highest 4-byte",
-        (isASCII) ? "\xf0\x8F\xbf\xbf" : I8_to_native("\xf0\xaf\xbf\xbf"),
-        4,
-        $::UTF8_ALLOW_LONG, $::UTF8_GOT_LONG,
-        (isASCII) ? 0xFFFF : 0x3FFF,
-        4, 2,
-        qr/overlong/
-    ],
-    [ "overlong malformation, lowest 5-byte",
-        (isASCII)
-         ?              "\xf8\x80\x80\x80\x80"
-         : I8_to_native("\xf8\xa0\xa0\xa0\xa0"),
-        5,
-        $::UTF8_ALLOW_LONG, $::UTF8_GOT_LONG,
-        0,   # NUL
-        5, 2,
-        qr/overlong/
-    ],
-    [ "overlong malformation, highest 5-byte",
-        (isASCII)
-         ?              "\xf8\x87\xbf\xbf\xbf"
-         : I8_to_native("\xf8\xa7\xbf\xbf\xbf"),
-        5,
-        $::UTF8_ALLOW_LONG, $::UTF8_GOT_LONG,
-        (isASCII) ? 0x1FFFFF : 0x3FFFF,
-        5, 2,
-        qr/overlong/
-    ],
-    [ "overlong malformation, lowest 6-byte",
-        (isASCII)
-         ?              "\xfc\x80\x80\x80\x80\x80"
-         : I8_to_native("\xfc\xa0\xa0\xa0\xa0\xa0"),
-        6,
-        $::UTF8_ALLOW_LONG, $::UTF8_GOT_LONG,
-        0,   # NUL
-        6, 2,
-        qr/overlong/
-    ],
-    [ "overlong malformation, highest 6-byte",
-        (isASCII)
-         ?              "\xfc\x83\xbf\xbf\xbf\xbf"
-         : I8_to_native("\xfc\xa3\xbf\xbf\xbf\xbf"),
-        6,
-        $::UTF8_ALLOW_LONG, $::UTF8_GOT_LONG,
-        (isASCII) ? 0x3FFFFFF : 0x3FFFFF,
-        6, 2,
-        qr/overlong/
-    ],
-    [ "overlong malformation, lowest 7-byte",
-        (isASCII)
-         ?              "\xfe\x80\x80\x80\x80\x80\x80"
-         : I8_to_native("\xfe\xa0\xa0\xa0\xa0\xa0\xa0"),
-        7,
-        $::UTF8_ALLOW_LONG, $::UTF8_GOT_LONG,
-        0,   # NUL
-        7, 2,
-        qr/overlong/
-    ],
-    [ "overlong malformation, highest 7-byte",
-        (isASCII)
-         ?              "\xfe\x81\xbf\xbf\xbf\xbf\xbf"
-         : I8_to_native("\xfe\xa1\xbf\xbf\xbf\xbf\xbf"),
-        7,
-        $::UTF8_ALLOW_LONG, $::UTF8_GOT_LONG,
-        (isASCII) ? 0x7FFFFFFF : 0x3FFFFFF,
-        7, 2,
-        qr/overlong/
-    ],
-);
-
-if (isASCII && ! $::is64bit) {    # 32-bit ASCII platform
-    no warnings 'portable';
-    push @malformations,
-        [ "overflow malformation",
-            "\xfe\x84\x80\x80\x80\x80\x80",  # Represents 2**32
-            7,
-            $::UTF8_ALLOW_OVERFLOW, $::UTF8_GOT_OVERFLOW,
-            $REPLACEMENT,
-            7, 2,
-            qr/overflows/
-        ],
-        [ "overflow malformation",
-            "\xff\x80\x80\x80\x80\x80\x81\x80\x80\x80\x80\x80\x80",
-            $::max_bytes,
-            $::UTF8_ALLOW_OVERFLOW, $::UTF8_GOT_OVERFLOW,
-            $REPLACEMENT,
-            $::max_bytes, 1,
-            qr/overflows/
-        ];
-}
-else { # 64-bit ASCII, or EBCDIC of any size.
-    # On EBCDIC platforms, another overlong test is needed even on 32-bit
-    # systems, whereas it doesn't happen on ASCII except on 64-bit ones.
-
-    no warnings 'portable';
-    no warnings 'overflow'; # Doesn't run on 32-bit systems, but compiles
-    push @malformations,
-        [ "overlong malformation, lowest max-byte",
-            (isASCII)
-             ?      "\xff\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80"
-             : I8_to_native(
-                    
"\xff\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0"),
-            $::max_bytes,
-            $::UTF8_ALLOW_LONG, $::UTF8_GOT_LONG,
-            0,   # NUL
-            $::max_bytes, (isASCII) ? 7 : 8,
-            qr/overlong/,
-        ],
-        [ "overlong malformation, highest max-byte",
-            (isASCII)    # 2**36-1 on ASCII; 2**30-1 on EBCDIC
-             ?      "\xff\x80\x80\x80\x80\x80\x80\xbf\xbf\xbf\xbf\xbf\xbf"
-             : I8_to_native(
-                    
"\xff\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xbf\xbf\xbf\xbf\xbf\xbf"),
-            $::max_bytes,
-            $::UTF8_ALLOW_LONG, $::UTF8_GOT_LONG,
-            (isASCII) ? 0xFFFFFFFFF : 0x3FFFFFFF,
-            $::max_bytes, (isASCII) ? 7 : 8,
-            qr/overlong/,
-        ];
-
-    if (! $::is64bit) {   # 32-bit EBCDIC
-        push @malformations,
-        [ "overflow malformation",
-            I8_to_native(
-                    
"\xff\xa0\xa0\xa0\xa0\xa0\xa0\xa4\xa0\xa0\xa0\xa0\xa0\xa0"),
-            $::max_bytes,
-            $::UTF8_ALLOW_OVERFLOW, $::UTF8_GOT_OVERFLOW,
-            $REPLACEMENT,
-            $::max_bytes, 8,
-            qr/overflows/
-        ];
-    }
-    else {  # 64-bit, either ASCII or EBCDIC
-        push @malformations,
-            [ "overflow malformation",
-               (isASCII)
-                ?   "\xff\x80\x90\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0"
-                : I8_to_native(
-                    
"\xff\xb0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0"),
-                $::max_bytes,
-                $::UTF8_ALLOW_OVERFLOW, $::UTF8_GOT_OVERFLOW,
-                $REPLACEMENT,
-                $::max_bytes, (isASCII) ? 3 : 2,
-                qr/overflows/
-            ];
-    }
-}
-
-# For each overlong malformation in the list, we modify it, so that there are
-# two tests.  The first one returns the replacement character given the input
-# flags, and the second test adds a flag that causes the actual code point the
-# malformation represents to be returned.
-my @added_overlongs;
-foreach my $test (@malformations) {
-    my ($testname, $bytes, $length, $allow_flags, $expected_error_flags,
-        $allowed_uv, $expected_len, $needed_to_discern_len, $message ) = 
@$test;
-    next unless $testname =~ /overlong/;
-
-    $test->[0] .= "; use REPLACEMENT CHAR";
-    $test->[5] = $REPLACEMENT;
-
-    push @added_overlongs,
-        [ $testname . "; use actual value",
-          $bytes, $length,
-          $allow_flags | $::UTF8_ALLOW_LONG_AND_ITS_VALUE,
-          $expected_error_flags, $allowed_uv, $expected_len,
-          $needed_to_discern_len, $message
-        ];
-}
-push @malformations, @added_overlongs;
-
-foreach my $test (@malformations) {
-    my ($testname, $bytes, $length, $allow_flags, $expected_error_flags,
-        $allowed_uv, $expected_len, $needed_to_discern_len, $message ) = 
@$test;
-
-    if (length($bytes) < $length) {
-        fail("Internal test error: actual buffer length (" . length($bytes)
-           . ") must be at least as high as how far we are allowed to read"
-           . " into it ($length)");
-        diag($testname);
-        next;
-    }
-
-    undef @warnings;
-
-    my $ret = test_isUTF8_CHAR($bytes, $length);
-    is($ret, 0, "$testname: isUTF8_CHAR returns 0");
-    is(scalar @warnings, 0, "$testname: isUTF8_CHAR() generated no warnings")
-      or output_warnings(@warnings);
-
-    undef @warnings;
-
-    $ret = test_isUTF8_CHAR_flags($bytes, $length, 0);
-    is($ret, 0, "$testname: isUTF8_CHAR_flags returns 0");
-    is(scalar @warnings, 0, "$testname: isUTF8_CHAR_flags() generated no"
-                          . " warnings")
-      or output_warnings(@warnings);
-
-    $ret = test_isSTRICT_UTF8_CHAR($bytes, $length);
-    is($ret, 0, "$testname: isSTRICT_UTF8_CHAR returns 0");
-    is(scalar @warnings, 0,
-                    "$testname: isSTRICT_UTF8_CHAR() generated no warnings")
-      or output_warnings(@warnings);
-
-    $ret = test_isC9_STRICT_UTF8_CHAR($bytes, $length);
-    is($ret, 0, "$testname: isC9_STRICT_UTF8_CHAR returns 0");
-    is(scalar @warnings, 0,
-               "$testname: isC9_STRICT_UTF8_CHAR() generated no warnings")
-      or output_warnings(@warnings);
-
-    for my $j (1 .. $length - 1) {
-        my $partial = substr($bytes, 0, $j);
-
-        undef @warnings;
-
-        $ret = test_is_utf8_valid_partial_char_flags($bytes, $j, 0);
-
-        my $ret_should_be = 0;
-        my $comment = "";
-        if ($j < $needed_to_discern_len) {
-            $ret_should_be = 1;
-            $comment = ", but need $needed_to_discern_len bytes to discern:";
-        }
-
-        is($ret, $ret_should_be, "$testname: is_utf8_valid_partial_char_flags("
-                                . display_bytes($partial)
-                                . ")$comment returns $ret_should_be");
-        is(scalar @warnings, 0,
-                "$testname: is_utf8_valid_partial_char_flags() generated"
-              . " no warnings")
-          or output_warnings(@warnings);
-    }
-
-
-    # Test what happens when this malformation is not allowed
-    undef @warnings;
-    my $ret_ref = test_utf8n_to_uvchr_error($bytes, $length, 0);
-    is($ret_ref->[0], 0, "$testname: disallowed: Returns 0");
-    is($ret_ref->[1], $expected_len,
-       "$testname: utf8n_to_uvchr_error(), disallowed: Returns expected"
-     . " length: $expected_len");
-    if (is(scalar @warnings, 1,
-           "$testname: disallowed: Got a single warning "))
-    {
-        like($warnings[0], $message,
-             "$testname: disallowed: Got expected warning");
-    }
-    else {
-        if (scalar @warnings) {
-            output_warnings(@warnings);
-        }
-    }
-    is($ret_ref->[2], $expected_error_flags,
-       "$testname: utf8n_to_uvchr_error(), disallowed:"
-     . " Returns expected error");
-
-    {   # Next test when disallowed, and warnings are off.
-        undef @warnings;
-        no warnings 'utf8';
-        my $ret_ref = test_utf8n_to_uvchr_error($bytes, $length, 0);
-        is($ret_ref->[0], 0,
-           "$testname: utf8n_to_uvchr_error(), disallowed: no warnings 'utf8':"
-         . " Returns 0");
-        is($ret_ref->[1], $expected_len,
-           "$testname: utf8n_to_uvchr_error(), disallowed: no warnings 'utf8':"
-         . " Returns expected length: $expected_len");
-        if (!is(scalar @warnings, 0,
-            "$testname: utf8n_to_uvchr_error(), disallowed: no warnings 
'utf8':"
-          . " no warnings generated"))
-        {
-            output_warnings(@warnings);
-        }
-        is($ret_ref->[2], $expected_error_flags,
-           "$testname: utf8n_to_uvchr_error(), disallowed: Returns"
-         . " expected error");
-    }
-
-    # Test with CHECK_ONLY
-    undef @warnings;
-    $ret_ref = test_utf8n_to_uvchr_error($bytes, $length, $::UTF8_CHECK_ONLY);
-    is($ret_ref->[0], 0, "$testname: CHECK_ONLY: Returns 0");
-    is($ret_ref->[1], -1, "$testname: CHECK_ONLY: returns -1 for length");
-    if (! is(scalar @warnings, 0,
-                               "$testname: CHECK_ONLY: no warnings generated"))
-    {
-        output_warnings(@warnings);
-    }
-    is($ret_ref->[2], $expected_error_flags,
-       "$testname: utf8n_to_uvchr_error(), disallowed: Returns expected"
-     . " error");
-
-    next if $allow_flags == 0;    # Skip if can't allow this malformation
-
-    # Test when the malformation is allowed
-    undef @warnings;
-    $ret_ref = test_utf8n_to_uvchr_error($bytes, $length, $allow_flags);
-    is($ret_ref->[0], $allowed_uv,
-       "$testname: utf8n_to_uvchr_error(), allowed: Returns expected uv: "
-     . sprintf("0x%04X", $allowed_uv));
-    is($ret_ref->[1], $expected_len,
-       "$testname: utf8n_to_uvchr_error(), allowed: Returns expected length:"
-     . " $expected_len");
-    if (!is(scalar @warnings, 0,
-            "$testname: utf8n_to_uvchr_error(), allowed: no warnings"
-          . " generated"))
-    {
-        output_warnings(@warnings);
-    }
-    is($ret_ref->[2], $expected_error_flags,
-       "$testname: utf8n_to_uvchr_error(), disallowed: Returns"
-     . " expected error");
-}
-
-done_testing;
diff --git a/ext/XS-APItest/t/utf8_setup.pl b/ext/XS-APItest/t/utf8_setup.pl
index 73275753c0..ec7a5ce3d1 100644
--- a/ext/XS-APItest/t/utf8_setup.pl
+++ b/ext/XS-APItest/t/utf8_setup.pl
@@ -8,16 +8,24 @@
 
 sub isASCII { ord "A" == 65 }
 
-sub display_bytes {
+sub display_bytes_no_quotes {
     use bytes;
     my $string = shift;
-    return   '"'
-           . join("", map { sprintf("\\x%02x", ord $_) } split "", $string)
-           . '"';
+    return join("", map { sprintf("\\x%02x", ord $_) } split "", $string)
+}
+
+sub display_bytes {
+    return   '"' . display_bytes_no_quotes(shift) . '"';
 }
 
 sub output_warnings(@) {
-    diag "The warnings were:\n" . join("", @_);
+    my @list = @_;
+    if (@list) {
+        diag "The warnings were:\n" . join "\n", map { chomp; $_ } @list;
+    }
+    else {
+        diag "No warnings were raised";
+    }
 }
 
 sub start_byte_to_cont($) {
@@ -43,7 +51,7 @@ sub start_byte_to_cont($) {
 
 $::is64bit = length sprintf("%x", ~0) > 8;
 
-$::first_continuation = (isASCII) ? 0x80 : 0xA0;
+$::lowest_continuation = (isASCII) ? 0x80 : 0xA0;
 
 $::I8c = (isASCII) ? "\x80" : "\xa0";    # A continuation byte
 
@@ -74,9 +82,9 @@ $::UTF8_WARN_NONCHAR           = 0x0800;
 $::UTF8_DISALLOW_SUPER         = 0x1000;
 $::UTF8_GOT_SUPER              = $UTF8_DISALLOW_SUPER;
 $::UTF8_WARN_SUPER             = 0x2000;
-$::UTF8_DISALLOW_ABOVE_31_BIT  = 0x4000;
-$::UTF8_GOT_ABOVE_31_BIT       = $UTF8_DISALLOW_ABOVE_31_BIT;
-$::UTF8_WARN_ABOVE_31_BIT      = 0x8000;
+$::UTF8_DISALLOW_PERL_EXTENDED  = 0x4000;
+$::UTF8_GOT_PERL_EXTENDED       = $UTF8_DISALLOW_PERL_EXTENDED;
+$::UTF8_WARN_PERL_EXTENDED      = 0x8000;
 $::UTF8_CHECK_ONLY             = 0x10000;
 $::UTF8_NO_CONFIDENCE_IN_CURLEN_ = 0x20000;
 
@@ -93,8 +101,8 @@ $::UTF8_WARN_ILLEGAL_INTERCHANGE
 $::UNICODE_WARN_SURROGATE        = 0x0001;
 $::UNICODE_WARN_NONCHAR          = 0x0002;
 $::UNICODE_WARN_SUPER            = 0x0004;
-$::UNICODE_WARN_ABOVE_31_BIT     = 0x0008;
+$::UNICODE_WARN_PERL_EXTENDED     = 0x0008;
 $::UNICODE_DISALLOW_SURROGATE    = 0x0010;
 $::UNICODE_DISALLOW_NONCHAR      = 0x0020;
 $::UNICODE_DISALLOW_SUPER        = 0x0040;
-$::UNICODE_DISALLOW_ABOVE_31_BIT = 0x0080;
+$::UNICODE_DISALLOW_PERL_EXTENDED = 0x0080;
diff --git a/ext/XS-APItest/t/utf8_warn_base.pl 
b/ext/XS-APItest/t/utf8_warn_base.pl
index a3f4052b55..3eddeaad9c 100644
--- a/ext/XS-APItest/t/utf8_warn_base.pl
+++ b/ext/XS-APItest/t/utf8_warn_base.pl
@@ -1,6 +1,11 @@
 #!perl -w
 
 # This is a base file to be used by various .t's in its directory
+# It tests various malformed UTF-8 sequences and some code points that are
+# "problematic", and verifies that the correct warnings/flags etc are
+# generated when using them.  For the code points, it also takes the UTF-8 and
+# perturbs it to be malformed in various ways, and tests that this gets
+# appropriately detected.
 
 use strict;
 use Test::More;
@@ -13,518 +18,683 @@ BEGIN {
 
 $|=1;
 
-no warnings 'deprecated'; # Some of the below are above IV_MAX on 32 bit
-                          # machines, and that is tested elsewhere
-
 use XS::APItest;
-use Data::Dumper;
 
-my @warnings;
+my @warnings_gotten;
 
 use warnings 'utf8';
-local $SIG{__WARN__} = sub { push @warnings, @_ };
+local $SIG{__WARN__} = sub { my @copy = @_;
+                             push @warnings_gotten, map { chomp; $_ } @copy;
+                           };
+
+my $highest_non_extended_utf8_cp = (isASCII) ? 0x7FFFFFFF : 0x3FFFFFFF;
+my $native_lowest_continuation_chr = I8_to_native(chr $::lowest_continuation);
 
-sub nonportable_regex ($) {
+# C5 is chosen as it is valid for both ASCII and EBCDIC platforms
+my $known_start_byte = I8_to_native("\xC5");
 
-    # Returns a pattern that matches the non-portable message raised either
-    # for the specific input code point, or the one generated when there
-    # is some malformation that precludes the message containing the specific
-    # code point
+sub requires_extended_utf8($) {
 
-    my $code_point = shift;
+    # Returns a boolean as to whether or not the code point parameter fits
+    # into 31 bits, subject to the convention that a negative code point
+    # stands for one that overflows the word size, so won't fit in 31 bits.
 
-    my $string = sprintf '(Code point 0x%X is not Unicode, and'
-                       . '|Any UTF-8 sequence that starts with'
-                       . ' "(\\\x[[:xdigit:]]{2})+" is for a'
-                       . ' non-Unicode code point, and is) not portable',
-                    $code_point;
-    return qr/$string/;
+    return shift > $highest_non_extended_utf8_cp;
 }
 
-# Now test the cases where a legal code point is generated, but may or may not
-# be allowed/warned on.
-my @tests = (
-     # ($testname, $bytes, $warn_flags, $disallow_flags, $expected_error_flags,
-     #  $category, $allowed_uv, $expected_len, $needed_to_discern_len, 
$message )
-    [ "lowest surrogate",
-        (isASCII) ? "\xed\xa0\x80" : I8_to_native("\xf1\xb6\xa0\xa0"),
-        $::UTF8_WARN_SURROGATE, $::UTF8_DISALLOW_SURROGATE, 
$::UTF8_GOT_SURROGATE,
-        'surrogate', 0xD800,
-        (isASCII) ? 3 : 4,
-        2,
-        qr/surrogate/
-    ],
-    [ "a middle surrogate",
-        (isASCII) ? "\xed\xa4\x8d" : I8_to_native("\xf1\xb6\xa8\xad"),
-        $::UTF8_WARN_SURROGATE, $::UTF8_DISALLOW_SURROGATE, 
$::UTF8_GOT_SURROGATE,
-        'surrogate', 0xD90D,
-        (isASCII) ? 3 : 4,
-        2,
-        qr/surrogate/
-    ],
-    [ "highest surrogate",
-        (isASCII) ? "\xed\xbf\xbf" : I8_to_native("\xf1\xb7\xbf\xbf"),
-        $::UTF8_WARN_SURROGATE, $::UTF8_DISALLOW_SURROGATE, 
$::UTF8_GOT_SURROGATE,
-        'surrogate', 0xDFFF,
-        (isASCII) ? 3 : 4,
-        2,
-        qr/surrogate/
-    ],
-    [ "first non_unicode",
-        (isASCII) ? "\xf4\x90\x80\x80" : I8_to_native("\xf9\xa2\xa0\xa0\xa0"),
-        $::UTF8_WARN_SUPER, $::UTF8_DISALLOW_SUPER, $::UTF8_GOT_SUPER,
-        'non_unicode', 0x110000,
-        (isASCII) ? 4 : 5,
-        2,
-        qr/(not Unicode|for a non-Unicode code point).* may not be portable/
-    ],
-    [ "non_unicode whose first byte tells that",
-        (isASCII) ? "\xf5\x80\x80\x80" : I8_to_native("\xfa\xa0\xa0\xa0\xa0"),
-        $::UTF8_WARN_SUPER, $::UTF8_DISALLOW_SUPER, $::UTF8_GOT_SUPER,
-        'non_unicode',
-        (isASCII) ? 0x140000 : 0x200000,
-        (isASCII) ? 4 : 5,
-        1,
-        qr/(not Unicode|for a non-Unicode code point).* may not be portable/
-    ],
-    [ "first of 32 consecutive non-character code points",
-        (isASCII) ? "\xef\xb7\x90" : I8_to_native("\xf1\xbf\xae\xb0"),
-        $::UTF8_WARN_NONCHAR, $::UTF8_DISALLOW_NONCHAR, $::UTF8_GOT_NONCHAR,
-        'nonchar', 0xFDD0,
-        (isASCII) ? 3 : 4,
-        (isASCII) ? 3 : 4,
-        qr/Unicode non-character.*is not recommended for open interchange/
-    ],
-    [ "a mid non-character code point of the 32 consecutive ones",
-        (isASCII) ? "\xef\xb7\xa0" : I8_to_native("\xf1\xbf\xaf\xa0"),
-        $::UTF8_WARN_NONCHAR, $::UTF8_DISALLOW_NONCHAR, $::UTF8_GOT_NONCHAR,
-        'nonchar', 0xFDE0,
-        (isASCII) ? 3 : 4,
-        (isASCII) ? 3 : 4,
-        qr/Unicode non-character.*is not recommended for open interchange/
-    ],
-    [ "final of 32 consecutive non-character code points",
-        (isASCII) ? "\xef\xb7\xaf" : I8_to_native("\xf1\xbf\xaf\xaf"),
-        $::UTF8_WARN_NONCHAR, $::UTF8_DISALLOW_NONCHAR, $::UTF8_GOT_NONCHAR,
-        'nonchar', 0xFDEF,
-        (isASCII) ? 3 : 4,
-        (isASCII) ? 3 : 4,
-        qr/Unicode non-character.*is not recommended for open interchange/
-    ],
-    [ "non-character code point U+FFFE",
-        (isASCII) ? "\xef\xbf\xbe" : I8_to_native("\xf1\xbf\xbf\xbe"),
-        $::UTF8_WARN_NONCHAR, $::UTF8_DISALLOW_NONCHAR, $::UTF8_GOT_NONCHAR,
-        'nonchar', 0xFFFE,
-        (isASCII) ? 3 : 4,
-        (isASCII) ? 3 : 4,
-        qr/Unicode non-character.*is not recommended for open interchange/
-    ],
-    [ "non-character code point U+FFFF",
-        (isASCII) ? "\xef\xbf\xbf" : I8_to_native("\xf1\xbf\xbf\xbf"),
-        $::UTF8_WARN_NONCHAR, $::UTF8_DISALLOW_NONCHAR, $::UTF8_GOT_NONCHAR,
-        'nonchar', 0xFFFF,
-        (isASCII) ? 3 : 4,
-        (isASCII) ? 3 : 4,
-        qr/Unicode non-character.*is not recommended for open interchange/
-    ],
-    [ "non-character code point U+1FFFE",
-        (isASCII) ? "\xf0\x9f\xbf\xbe" : I8_to_native("\xf3\xbf\xbf\xbe"),
-        $::UTF8_WARN_NONCHAR, $::UTF8_DISALLOW_NONCHAR, $::UTF8_GOT_NONCHAR,
-        'nonchar', 0x1FFFE,
-        4, 4,
-        qr/Unicode non-character.*is not recommended for open interchange/
-    ],
-    [ "non-character code point U+1FFFF",
-        (isASCII) ? "\xf0\x9f\xbf\xbf" : I8_to_native("\xf3\xbf\xbf\xbf"),
-        $::UTF8_WARN_NONCHAR, $::UTF8_DISALLOW_NONCHAR, $::UTF8_GOT_NONCHAR,
-        'nonchar', 0x1FFFF,
-        4, 4,
-        qr/Unicode non-character.*is not recommended for open interchange/
-    ],
-    [ "non-character code point U+2FFFE",
-        (isASCII) ? "\xf0\xaf\xbf\xbe" : I8_to_native("\xf5\xbf\xbf\xbe"),
-        $::UTF8_WARN_NONCHAR, $::UTF8_DISALLOW_NONCHAR, $::UTF8_GOT_NONCHAR,
-        'nonchar', 0x2FFFE,
-        4, 4,
-        qr/Unicode non-character.*is not recommended for open interchange/
-    ],
-    [ "non-character code point U+2FFFF",
-        (isASCII) ? "\xf0\xaf\xbf\xbf" : I8_to_native("\xf5\xbf\xbf\xbf"),
-        $::UTF8_WARN_NONCHAR, $::UTF8_DISALLOW_NONCHAR, $::UTF8_GOT_NONCHAR,
-        'nonchar', 0x2FFFF,
-        4, 4,
-        qr/Unicode non-character.*is not recommended for open interchange/
-    ],
-    [ "non-character code point U+3FFFE",
-        (isASCII) ? "\xf0\xbf\xbf\xbe" : I8_to_native("\xf7\xbf\xbf\xbe"),
-        $::UTF8_WARN_NONCHAR, $::UTF8_DISALLOW_NONCHAR, $::UTF8_GOT_NONCHAR,
-        'nonchar', 0x3FFFE,
-        4, 4,
-        qr/Unicode non-character.*is not recommended for open interchange/
-    ],
-    [ "non-character code point U+3FFFF",
-        (isASCII) ? "\xf0\xbf\xbf\xbf" : I8_to_native("\xf7\xbf\xbf\xbf"),
-        $::UTF8_WARN_NONCHAR, $::UTF8_DISALLOW_NONCHAR, $::UTF8_GOT_NONCHAR,
-        'nonchar', 0x3FFFF,
-        4, 4,
-        qr/Unicode non-character.*is not recommended for open interchange/
-    ],
-    [ "non-character code point U+4FFFE",
-        (isASCII) ? "\xf1\x8f\xbf\xbe" : I8_to_native("\xf8\xa9\xbf\xbf\xbe"),
-        $::UTF8_WARN_NONCHAR, $::UTF8_DISALLOW_NONCHAR, $::UTF8_GOT_NONCHAR,
-        'nonchar', 0x4FFFE,
-        (isASCII) ? 4 : 5,
-        (isASCII) ? 4 : 5,
-        qr/Unicode non-character.*is not recommended for open interchange/
-    ],
-    [ "non-character code point U+4FFFF",
-        (isASCII) ? "\xf1\x8f\xbf\xbf" : I8_to_native("\xf8\xa9\xbf\xbf\xbf"),
-        $::UTF8_WARN_NONCHAR, $::UTF8_DISALLOW_NONCHAR, $::UTF8_GOT_NONCHAR,
-        'nonchar', 0x4FFFF,
-        (isASCII) ? 4 : 5,
-        (isASCII) ? 4 : 5,
-        qr/Unicode non-character.*is not recommended for open interchange/
-    ],
-    [ "non-character code point U+5FFFE",
-        (isASCII) ? "\xf1\x9f\xbf\xbe" : I8_to_native("\xf8\xab\xbf\xbf\xbe"),
-        $::UTF8_WARN_NONCHAR, $::UTF8_DISALLOW_NONCHAR, $::UTF8_GOT_NONCHAR,
-        'nonchar', 0x5FFFE,
-        (isASCII) ? 4 : 5,
-        (isASCII) ? 4 : 5,
-        qr/Unicode non-character.*is not recommended for open interchange/
-    ],
-    [ "non-character code point U+5FFFF",
-        (isASCII) ? "\xf1\x9f\xbf\xbf" : I8_to_native("\xf8\xab\xbf\xbf\xbf"),
-        $::UTF8_WARN_NONCHAR, $::UTF8_DISALLOW_NONCHAR, $::UTF8_GOT_NONCHAR,
-        'nonchar', 0x5FFFF,
-        (isASCII) ? 4 : 5,
-        (isASCII) ? 4 : 5,
-        qr/Unicode non-character.*is not recommended for open interchange/
-    ],
-    [ "non-character code point U+6FFFE",
-        (isASCII) ? "\xf1\xaf\xbf\xbe" : I8_to_native("\xf8\xad\xbf\xbf\xbe"),
-        $::UTF8_WARN_NONCHAR, $::UTF8_DISALLOW_NONCHAR, $::UTF8_GOT_NONCHAR,
-        'nonchar', 0x6FFFE,
-        (isASCII) ? 4 : 5,
-        (isASCII) ? 4 : 5,
-        qr/Unicode non-character.*is not recommended for open interchange/
-    ],
-    [ "non-character code point U+6FFFF",
-        (isASCII) ? "\xf1\xaf\xbf\xbf" : I8_to_native("\xf8\xad\xbf\xbf\xbf"),
-        $::UTF8_WARN_NONCHAR, $::UTF8_DISALLOW_NONCHAR, $::UTF8_GOT_NONCHAR,
-        'nonchar', 0x6FFFF,
-        (isASCII) ? 4 : 5,
-        (isASCII) ? 4 : 5,
-        qr/Unicode non-character.*is not recommended for open interchange/
-    ],
-    [ "non-character code point U+7FFFE",
-        (isASCII) ? "\xf1\xbf\xbf\xbe" : I8_to_native("\xf8\xaf\xbf\xbf\xbe"),
-        $::UTF8_WARN_NONCHAR, $::UTF8_DISALLOW_NONCHAR, $::UTF8_GOT_NONCHAR,
-        'nonchar', 0x7FFFE,
-        (isASCII) ? 4 : 5,
-        (isASCII) ? 4 : 5,
-        qr/Unicode non-character.*is not recommended for open interchange/
-    ],
-    [ "non-character code point U+7FFFF",
-        (isASCII) ? "\xf1\xbf\xbf\xbf" : I8_to_native("\xf8\xaf\xbf\xbf\xbf"),
-        $::UTF8_WARN_NONCHAR, $::UTF8_DISALLOW_NONCHAR, $::UTF8_GOT_NONCHAR,
-        'nonchar', 0x7FFFF,
-        (isASCII) ? 4 : 5,
-        (isASCII) ? 4 : 5,
-        qr/Unicode non-character.*is not recommended for open interchange/
-    ],
-    [ "non-character code point U+8FFFE",
-        (isASCII) ? "\xf2\x8f\xbf\xbe" : I8_to_native("\xf8\xb1\xbf\xbf\xbe"),
-        $::UTF8_WARN_NONCHAR, $::UTF8_DISALLOW_NONCHAR, $::UTF8_GOT_NONCHAR,
-        'nonchar', 0x8FFFE,
-        (isASCII) ? 4 : 5,
-        (isASCII) ? 4 : 5,
-        qr/Unicode non-character.*is not recommended for open interchange/
-    ],
-    [ "non-character code point U+8FFFF",
-        (isASCII) ? "\xf2\x8f\xbf\xbf" : I8_to_native("\xf8\xb1\xbf\xbf\xbf"),
-        $::UTF8_WARN_NONCHAR, $::UTF8_DISALLOW_NONCHAR, $::UTF8_GOT_NONCHAR,
-        'nonchar', 0x8FFFF,
-        (isASCII) ? 4 : 5,
-        (isASCII) ? 4 : 5,
-        qr/Unicode non-character.*is not recommended for open interchange/
-    ],
-    [ "non-character code point U+9FFFE",
-        (isASCII) ? "\xf2\x9f\xbf\xbe" : I8_to_native("\xf8\xb3\xbf\xbf\xbe"),
-        $::UTF8_WARN_NONCHAR, $::UTF8_DISALLOW_NONCHAR, $::UTF8_GOT_NONCHAR,
-        'nonchar', 0x9FFFE,
-        (isASCII) ? 4 : 5,
-        (isASCII) ? 4 : 5,
-        qr/Unicode non-character.*is not recommended for open interchange/
-    ],
-    [ "non-character code point U+9FFFF",
-        (isASCII) ? "\xf2\x9f\xbf\xbf" : I8_to_native("\xf8\xb3\xbf\xbf\xbf"),
-        $::UTF8_WARN_NONCHAR, $::UTF8_DISALLOW_NONCHAR, $::UTF8_GOT_NONCHAR,
-        'nonchar', 0x9FFFF,
-        (isASCII) ? 4 : 5,
-        (isASCII) ? 4 : 5,
-        qr/Unicode non-character.*is not recommended for open interchange/
-    ],
-    [ "non-character code point U+AFFFE",
-        (isASCII) ? "\xf2\xaf\xbf\xbe" : I8_to_native("\xf8\xb5\xbf\xbf\xbe"),
-        $::UTF8_WARN_NONCHAR, $::UTF8_DISALLOW_NONCHAR, $::UTF8_GOT_NONCHAR,
-        'nonchar', 0xAFFFE,
-        (isASCII) ? 4 : 5,
-        (isASCII) ? 4 : 5,
-        qr/Unicode non-character.*is not recommended for open interchange/
-    ],
-    [ "non-character code point U+AFFFF",
-        (isASCII) ? "\xf2\xaf\xbf\xbf" : I8_to_native("\xf8\xb5\xbf\xbf\xbf"),
-        $::UTF8_WARN_NONCHAR, $::UTF8_DISALLOW_NONCHAR, $::UTF8_GOT_NONCHAR,
-        'nonchar', 0xAFFFF,
-        (isASCII) ? 4 : 5,
-        (isASCII) ? 4 : 5,
-        qr/Unicode non-character.*is not recommended for open interchange/
-    ],
-    [ "non-character code point U+BFFFE",
-        (isASCII) ? "\xf2\xbf\xbf\xbe" : I8_to_native("\xf8\xb7\xbf\xbf\xbe"),
-        $::UTF8_WARN_NONCHAR, $::UTF8_DISALLOW_NONCHAR, $::UTF8_GOT_NONCHAR,
-        'nonchar', 0xBFFFE,
-        (isASCII) ? 4 : 5,
-        (isASCII) ? 4 : 5,
-        qr/Unicode non-character.*is not recommended for open interchange/
-    ],
-    [ "non-character code point U+BFFFF",
-        (isASCII) ? "\xf2\xbf\xbf\xbf" : I8_to_native("\xf8\xb7\xbf\xbf\xbf"),
-        $::UTF8_WARN_NONCHAR, $::UTF8_DISALLOW_NONCHAR, $::UTF8_GOT_NONCHAR,
-        'nonchar', 0xBFFFF,
-        (isASCII) ? 4 : 5,
-        (isASCII) ? 4 : 5,
-        qr/Unicode non-character.*is not recommended for open interchange/
-    ],
-    [ "non-character code point U+CFFFE",
-        (isASCII) ? "\xf3\x8f\xbf\xbe" : I8_to_native("\xf8\xb9\xbf\xbf\xbe"),
-        $::UTF8_WARN_NONCHAR, $::UTF8_DISALLOW_NONCHAR, $::UTF8_GOT_NONCHAR,
-        'nonchar', 0xCFFFE,
-        (isASCII) ? 4 : 5,
-        (isASCII) ? 4 : 5,
-        qr/Unicode non-character.*is not recommended for open interchange/
-    ],
-    [ "non-character code point U+CFFFF",
-        (isASCII) ? "\xf3\x8f\xbf\xbf" : I8_to_native("\xf8\xb9\xbf\xbf\xbf"),
-        $::UTF8_WARN_NONCHAR, $::UTF8_DISALLOW_NONCHAR, $::UTF8_GOT_NONCHAR,
-        'nonchar', 0xCFFFF,
-        (isASCII) ? 4 : 5,
-        (isASCII) ? 4 : 5,
-        qr/Unicode non-character.*is not recommended for open interchange/
-    ],
-    [ "non-character code point U+DFFFE",
-        (isASCII) ? "\xf3\x9f\xbf\xbe" : I8_to_native("\xf8\xbb\xbf\xbf\xbe"),
-        $::UTF8_WARN_NONCHAR, $::UTF8_DISALLOW_NONCHAR, $::UTF8_GOT_NONCHAR,
-        'nonchar', 0xDFFFE,
-        (isASCII) ? 4 : 5,
-        (isASCII) ? 4 : 5,
-        qr/Unicode non-character.*is not recommended for open interchange/
-    ],
-    [ "non-character code point U+DFFFF",
-        (isASCII) ? "\xf3\x9f\xbf\xbf" : I8_to_native("\xf8\xbb\xbf\xbf\xbf"),
-        $::UTF8_WARN_NONCHAR, $::UTF8_DISALLOW_NONCHAR, $::UTF8_GOT_NONCHAR,
-        'nonchar', 0xDFFFF,
-        (isASCII) ? 4 : 5,
-        (isASCII) ? 4 : 5,
-        qr/Unicode non-character.*is not recommended for open interchange/
-    ],
-    [ "non-character code point U+EFFFE",
-        (isASCII) ? "\xf3\xaf\xbf\xbe" : I8_to_native("\xf8\xbd\xbf\xbf\xbe"),
-        $::UTF8_WARN_NONCHAR, $::UTF8_DISALLOW_NONCHAR, $::UTF8_GOT_NONCHAR,
-        'nonchar', 0xEFFFE,
-        (isASCII) ? 4 : 5,
-        (isASCII) ? 4 : 5,
-        qr/Unicode non-character.*is not recommended for open interchange/
-    ],
-    [ "non-character code point U+EFFFF",
-        (isASCII) ? "\xf3\xaf\xbf\xbf" : I8_to_native("\xf8\xbd\xbf\xbf\xbf"),
-        $::UTF8_WARN_NONCHAR, $::UTF8_DISALLOW_NONCHAR, $::UTF8_GOT_NONCHAR,
-        'nonchar', 0xEFFFF,
-        (isASCII) ? 4 : 5,
-        (isASCII) ? 4 : 5,
-        qr/Unicode non-character.*is not recommended for open interchange/
-    ],
-    [ "non-character code point U+FFFFE",
-        (isASCII) ? "\xf3\xbf\xbf\xbe" : I8_to_native("\xf8\xbf\xbf\xbf\xbe"),
-        $::UTF8_WARN_NONCHAR, $::UTF8_DISALLOW_NONCHAR, $::UTF8_GOT_NONCHAR,
-        'nonchar', 0xFFFFE,
-        (isASCII) ? 4 : 5,
-        (isASCII) ? 4 : 5,
-        qr/Unicode non-character.*is not recommended for open interchange/
-    ],
-    [ "non-character code point U+FFFFF",
-        (isASCII) ? "\xf3\xbf\xbf\xbf" : I8_to_native("\xf8\xbf\xbf\xbf\xbf"),
-        $::UTF8_WARN_NONCHAR, $::UTF8_DISALLOW_NONCHAR, $::UTF8_GOT_NONCHAR,
-        'nonchar', 0xFFFFF,
-        (isASCII) ? 4 : 5,
-        (isASCII) ? 4 : 5,
-        qr/Unicode non-character.*is not recommended for open interchange/
-    ],
-    [ "non-character code point U+10FFFE",
-        (isASCII) ? "\xf4\x8f\xbf\xbe" : I8_to_native("\xf9\xa1\xbf\xbf\xbe"),
-        $::UTF8_WARN_NONCHAR, $::UTF8_DISALLOW_NONCHAR, $::UTF8_GOT_NONCHAR,
-        'nonchar', 0x10FFFE,
-        (isASCII) ? 4 : 5,
-        (isASCII) ? 4 : 5,
-        qr/Unicode non-character.*is not recommended for open interchange/
-    ],
-    [ "non-character code point U+10FFFF",
-        (isASCII) ? "\xf4\x8f\xbf\xbf" : I8_to_native("\xf9\xa1\xbf\xbf\xbf"),
-        $::UTF8_WARN_NONCHAR, $::UTF8_DISALLOW_NONCHAR, $::UTF8_GOT_NONCHAR,
-        'nonchar', 0x10FFFF,
-        (isASCII) ? 4 : 5,
-        (isASCII) ? 4 : 5,
-        qr/Unicode non-character.*is not recommended for open interchange/
-    ],
-);
-
-if (! $::is64bit) {
+sub is_extended_utf8($) {
+
+    # Returns a boolean as to whether or not the input UTF-8 sequence uses
+    # Perl extended UTF-8.
+
+    my $byte = substr(shift, 0, 1);
+    return ord $byte >= 0xFE if isASCII;
+    return $byte == I8_to_native("\xFF");
+}
+
+sub overflow_discern_len($) {
+
+    # Returns how many bytes are needed to tell if a non-overlong UTF-8
+    # sequence is for a code point that won't fit in the platform's word size.
+    # Only the length of the sequence representing a single code point is
+    # needed.
+
     if (isASCII) {
-        no warnings qw{portable overflow};
-        push @tests,
-            [ "Lowest 33 bit code point: overflow",
-                "\xFE\x84\x80\x80\x80\x80\x80",
-                $::UTF8_WARN_ABOVE_31_BIT, $::UTF8_DISALLOW_ABOVE_31_BIT,
-                $::UTF8_GOT_ABOVE_31_BIT,
-                'utf8', 0x100000000,
-                7, 1,
-                qr/and( is)? not portable/
-            ];
+        return ($::is64bit) ? 3 : 1;
+
+        # Below is needed for code points above IV_MAX
+        #return ($::is64bit) ? 3 : ((shift == $::max_bytes)
+        #                           ? 1
+        #                           : 2);
     }
+
+    return ($::is64bit) ? 2 : 8;
 }
-else {
-    no warnings qw{portable overflow};
-    push @tests,
-        [ "More than 32 bits",
+
+sub overlong_discern_len($) {
+
+    # Returns how many bytes are needed to tell if the input UTF-8 sequence
+    # for a code point is overlong
+
+    my $string = shift;
+    my $length = length $string;
+    my $byte = ord native_to_I8(substr($string, 0, 1));
+    if (isASCII) {
+        return ($byte >= 0xFE)
+                ? ((! $::is64bit)
+                    ? 1
+                    : ($byte == 0xFF) ? 7 : 2)
+                : (($length == 2) ? 1 : 2);
+        # Below is needed for code points above IV_MAX
+        #return ($length == $::max_bytes)
+        #          # This is constrained to 1 on 32-bit machines, as it
+        #          # overflows there
+        #        ? (($::is64bit) ? 7 : 1)
+        #        : (($length == 2) ? 1 : 2);
+    }
+
+    return ($length == $::max_bytes) ? 8 : (($length <= 3) ? 1 : 2);
+}
+
+my @tests;
+{
+    no warnings qw(portable overflow);
+    @tests = (
+        # $testname,
+        # $bytes,                  UTF-8 string
+        # $allowed_uv,             code point $bytes evaluates to; -1 if
+        #                          overflows
+        # $needed_to_discern_len   optional, how long an initial substring do
+        #                          we need to tell that the string must be for
+        #                          a code point in the category it falls in,
+        #                          like being a surrogate; 0 indicates we need
+        #                          the whole string.  Some categories have a
+        #                          default that is used if this is omitted.
+        [ "orphan continuation byte malformation",
+            I8_to_native("$::I8c"),
+            0xFFFD,
+            1,
+        ],
+        [ "overlong malformation, lowest 2-byte",
+            (isASCII) ? "\xc0\x80" : I8_to_native("\xc0\xa0"),
+            0,   # NUL
+        ],
+        [ "overlong malformation, highest 2-byte",
+            (isASCII) ? "\xc1\xbf" : I8_to_native("\xc4\xbf"),
+            (isASCII) ? 0x7F : utf8::unicode_to_native(0x9F),
+        ],
+        [ "overlong malformation, lowest 3-byte",
+            (isASCII) ? "\xe0\x80\x80" : I8_to_native("\xe0\xa0\xa0"),
+            0,   # NUL
+        ],
+        [ "overlong malformation, highest 3-byte",
+            (isASCII) ? "\xe0\x9f\xbf" : I8_to_native("\xe0\xbf\xbf"),
+            (isASCII) ? 0x7FF : 0x3FF,
+        ],
+        [ "lowest surrogate",
+            (isASCII) ? "\xed\xa0\x80" : I8_to_native("\xf1\xb6\xa0\xa0"),
+            0xD800,
+        ],
+        [ "a middle surrogate",
+            (isASCII) ? "\xed\xa4\x8d" : I8_to_native("\xf1\xb6\xa8\xad"),
+            0xD90D,
+        ],
+        [ "highest surrogate",
+            (isASCII) ? "\xed\xbf\xbf" : I8_to_native("\xf1\xb7\xbf\xbf"),
+            0xDFFF,
+        ],
+        [ "first of 32 consecutive non-character code points",
+            (isASCII) ? "\xef\xb7\x90" : I8_to_native("\xf1\xbf\xae\xb0"),
+            0xFDD0,
+        ],
+        [ "a mid non-character code point of the 32 consecutive ones",
+            (isASCII) ? "\xef\xb7\xa0" : I8_to_native("\xf1\xbf\xaf\xa0"),
+            0xFDE0,
+        ],
+        [ "final of 32 consecutive non-character code points",
+            (isASCII) ? "\xef\xb7\xaf" : I8_to_native("\xf1\xbf\xaf\xaf"),
+            0xFDEF,
+        ],
+        [ "non-character code point U+FFFE",
+            (isASCII) ? "\xef\xbf\xbe" : I8_to_native("\xf1\xbf\xbf\xbe"),
+            0xFFFE,
+        ],
+        [ "non-character code point U+FFFF",
+            (isASCII) ? "\xef\xbf\xbf" : I8_to_native("\xf1\xbf\xbf\xbf"),
+            0xFFFF,
+        ],
+        [ "overlong malformation, lowest 4-byte",
+            (isASCII) ? "\xf0\x80\x80\x80" : I8_to_native("\xf0\xa0\xa0\xa0"),
+            0,   # NUL
+        ],
+        [ "overlong malformation, highest 4-byte",
+            (isASCII) ? "\xf0\x8F\xbf\xbf" : I8_to_native("\xf0\xaf\xbf\xbf"),
+            (isASCII) ? 0xFFFF : 0x3FFF,
+        ],
+        [ "non-character code point U+1FFFE",
+            (isASCII) ? "\xf0\x9f\xbf\xbe" : I8_to_native("\xf3\xbf\xbf\xbe"),
+            0x1FFFE,
+        ],
+        [ "non-character code point U+1FFFF",
+            (isASCII) ? "\xf0\x9f\xbf\xbf" : I8_to_native("\xf3\xbf\xbf\xbf"),
+            0x1FFFF,
+        ],
+        [ "non-character code point U+2FFFE",
+            (isASCII) ? "\xf0\xaf\xbf\xbe" : I8_to_native("\xf5\xbf\xbf\xbe"),
+            0x2FFFE,
+        ],
+        [ "non-character code point U+2FFFF",
+            (isASCII) ? "\xf0\xaf\xbf\xbf" : I8_to_native("\xf5\xbf\xbf\xbf"),
+            0x2FFFF,
+        ],
+        [ "non-character code point U+3FFFE",
+            (isASCII) ? "\xf0\xbf\xbf\xbe" : I8_to_native("\xf7\xbf\xbf\xbe"),
+            0x3FFFE,
+        ],
+        [ "non-character code point U+3FFFF",
+            (isASCII) ? "\xf0\xbf\xbf\xbf" : I8_to_native("\xf7\xbf\xbf\xbf"),
+            0x3FFFF,
+        ],
+        [ "non-character code point U+4FFFE",
+            (isASCII)
+            ?               "\xf1\x8f\xbf\xbe"
+            : I8_to_native("\xf8\xa9\xbf\xbf\xbe"),
+            0x4FFFE,
+        ],
+        [ "non-character code point U+4FFFF",
+            (isASCII)
+            ?               "\xf1\x8f\xbf\xbf"
+            : I8_to_native("\xf8\xa9\xbf\xbf\xbf"),
+            0x4FFFF,
+        ],
+        [ "non-character code point U+5FFFE",
+            (isASCII)
+            ?              "\xf1\x9f\xbf\xbe"
+            : I8_to_native("\xf8\xab\xbf\xbf\xbe"),
+            0x5FFFE,
+        ],
+        [ "non-character code point U+5FFFF",
+            (isASCII)
+            ?              "\xf1\x9f\xbf\xbf"
+            : I8_to_native("\xf8\xab\xbf\xbf\xbf"),
+            0x5FFFF,
+        ],
+        [ "non-character code point U+6FFFE",
+            (isASCII)
+            ?              "\xf1\xaf\xbf\xbe"
+            : I8_to_native("\xf8\xad\xbf\xbf\xbe"),
+            0x6FFFE,
+        ],
+        [ "non-character code point U+6FFFF",
+            (isASCII)
+            ?              "\xf1\xaf\xbf\xbf"
+            : I8_to_native("\xf8\xad\xbf\xbf\xbf"),
+            0x6FFFF,
+        ],
+        [ "non-character code point U+7FFFE",
+            (isASCII)
+            ?              "\xf1\xbf\xbf\xbe"
+            : I8_to_native("\xf8\xaf\xbf\xbf\xbe"),
+            0x7FFFE,
+        ],
+        [ "non-character code point U+7FFFF",
+            (isASCII)
+            ?              "\xf1\xbf\xbf\xbf"
+            : I8_to_native("\xf8\xaf\xbf\xbf\xbf"),
+            0x7FFFF,
+        ],
+        [ "non-character code point U+8FFFE",
+            (isASCII)
+            ?              "\xf2\x8f\xbf\xbe"
+            : I8_to_native("\xf8\xb1\xbf\xbf\xbe"),
+            0x8FFFE,
+        ],
+        [ "non-character code point U+8FFFF",
+            (isASCII)
+            ?              "\xf2\x8f\xbf\xbf"
+            : I8_to_native("\xf8\xb1\xbf\xbf\xbf"),
+            0x8FFFF,
+        ],
+        [ "non-character code point U+9FFFE",
+            (isASCII)
+            ?              "\xf2\x9f\xbf\xbe"
+            : I8_to_native("\xf8\xb3\xbf\xbf\xbe"),
+            0x9FFFE,
+        ],
+        [ "non-character code point U+9FFFF",
+            (isASCII)
+            ?              "\xf2\x9f\xbf\xbf"
+            : I8_to_native("\xf8\xb3\xbf\xbf\xbf"),
+            0x9FFFF,
+        ],
+        [ "non-character code point U+AFFFE",
+            (isASCII)
+            ?              "\xf2\xaf\xbf\xbe"
+            : I8_to_native("\xf8\xb5\xbf\xbf\xbe"),
+            0xAFFFE,
+        ],
+        [ "non-character code point U+AFFFF",
+            (isASCII)
+            ?              "\xf2\xaf\xbf\xbf"
+            : I8_to_native("\xf8\xb5\xbf\xbf\xbf"),
+            0xAFFFF,
+        ],
+        [ "non-character code point U+BFFFE",
+            (isASCII)
+            ?              "\xf2\xbf\xbf\xbe"
+            : I8_to_native("\xf8\xb7\xbf\xbf\xbe"),
+            0xBFFFE,
+        ],
+        [ "non-character code point U+BFFFF",
+            (isASCII)
+            ?              "\xf2\xbf\xbf\xbf"
+            : I8_to_native("\xf8\xb7\xbf\xbf\xbf"),
+            0xBFFFF,
+        ],
+        [ "non-character code point U+CFFFE",
+            (isASCII)
+            ?              "\xf3\x8f\xbf\xbe"
+            : I8_to_native("\xf8\xb9\xbf\xbf\xbe"),
+            0xCFFFE,
+        ],
+        [ "non-character code point U+CFFFF",
+            (isASCII)
+            ?              "\xf3\x8f\xbf\xbf"
+            : I8_to_native("\xf8\xb9\xbf\xbf\xbf"),
+            0xCFFFF,
+        ],
+        [ "non-character code point U+DFFFE",
+            (isASCII)
+            ?              "\xf3\x9f\xbf\xbe"
+            : I8_to_native("\xf8\xbb\xbf\xbf\xbe"),
+            0xDFFFE,
+        ],
+        [ "non-character code point U+DFFFF",
+            (isASCII)
+            ?              "\xf3\x9f\xbf\xbf"
+            : I8_to_native("\xf8\xbb\xbf\xbf\xbf"),
+            0xDFFFF,
+        ],
+        [ "non-character code point U+EFFFE",
+            (isASCII)
+            ?              "\xf3\xaf\xbf\xbe"
+            : I8_to_native("\xf8\xbd\xbf\xbf\xbe"),
+            0xEFFFE,
+        ],
+        [ "non-character code point U+EFFFF",
+            (isASCII)
+            ?              "\xf3\xaf\xbf\xbf"
+            : I8_to_native("\xf8\xbd\xbf\xbf\xbf"),
+            0xEFFFF,
+        ],
+        [ "non-character code point U+FFFFE",
+            (isASCII)
+            ?              "\xf3\xbf\xbf\xbe"
+            : I8_to_native("\xf8\xbf\xbf\xbf\xbe"),
+            0xFFFFE,
+        ],
+        [ "non-character code point U+FFFFF",
+            (isASCII)
+            ?              "\xf3\xbf\xbf\xbf"
+            : I8_to_native("\xf8\xbf\xbf\xbf\xbf"),
+            0xFFFFF,
+        ],
+        [ "non-character code point U+10FFFE",
+            (isASCII)
+            ?              "\xf4\x8f\xbf\xbe"
+            : I8_to_native("\xf9\xa1\xbf\xbf\xbe"),
+            0x10FFFE,
+        ],
+        [ "non-character code point U+10FFFF",
+            (isASCII)
+            ?              "\xf4\x8f\xbf\xbf"
+            : I8_to_native("\xf9\xa1\xbf\xbf\xbf"),
+            0x10FFFF,
+        ],
+        [ "first non_unicode",
+            (isASCII)
+            ?              "\xf4\x90\x80\x80"
+            : I8_to_native("\xf9\xa2\xa0\xa0\xa0"),
+            0x110000,
+            2,
+        ],
+        [ "non_unicode whose first byte tells that",
+            (isASCII)
+            ?              "\xf5\x80\x80\x80"
+            : I8_to_native("\xfa\xa0\xa0\xa0\xa0"),
+            (isASCII) ? 0x140000 : 0x200000,
+            1,
+        ],
+        [ "overlong malformation, lowest 5-byte",
             (isASCII)
-            ?       "\xff\x80\x80\x80\x80\x80\x81\x80\x80\x80\x80\x80\x80"
+            ?              "\xf8\x80\x80\x80\x80"
+            : I8_to_native("\xf8\xa0\xa0\xa0\xa0"),
+            0,   # NUL
+        ],
+        [ "overlong malformation, highest 5-byte",
+            (isASCII)
+            ?              "\xf8\x87\xbf\xbf\xbf"
+            : I8_to_native("\xf8\xa7\xbf\xbf\xbf"),
+            (isASCII) ? 0x1FFFFF : 0x3FFFF,
+        ],
+        [ "overlong malformation, lowest 6-byte",
+            (isASCII)
+            ?              "\xfc\x80\x80\x80\x80\x80"
+            : I8_to_native("\xfc\xa0\xa0\xa0\xa0\xa0"),
+            0,   # NUL
+        ],
+        [ "overlong malformation, highest 6-byte",
+            (isASCII)
+            ?              "\xfc\x83\xbf\xbf\xbf\xbf"
+            : I8_to_native("\xfc\xa3\xbf\xbf\xbf\xbf"),
+            (isASCII) ? 0x3FFFFFF : 0x3FFFFF,
+        ],
+        [ "overlong malformation, lowest 7-byte",
+            (isASCII)
+            ?              "\xfe\x80\x80\x80\x80\x80\x80"
+            : I8_to_native("\xfe\xa0\xa0\xa0\xa0\xa0\xa0"),
+            0,   # NUL
+        ],
+        [ "overlong malformation, highest 7-byte",
+            (isASCII)
+            ?              "\xfe\x81\xbf\xbf\xbf\xbf\xbf"
+            : I8_to_native("\xfe\xa1\xbf\xbf\xbf\xbf\xbf"),
+            (isASCII) ? 0x7FFFFFFF : 0x3FFFFFF,
+        ],
+        [ "highest 31 bit code point",
+            (isASCII)
+            ?  "\xfd\xbf\xbf\xbf\xbf\xbf"
             : I8_to_native(
-                    
"\xff\xa0\xa0\xa0\xa0\xa0\xa2\xa0\xa0\xa0\xa0\xa0\xa0\xa0"),
-            $::UTF8_WARN_ABOVE_31_BIT, $::UTF8_DISALLOW_ABOVE_31_BIT,
-            $::UTF8_GOT_ABOVE_31_BIT,
-            'utf8', 0x1000000000,
-            $::max_bytes, (isASCII) ? 1 : 7,
-            qr/and( is)? not portable/
-        ];
-        [ "requires at least 32 bits",
+               "\xff\xa0\xa0\xa0\xa0\xa0\xa0\xa1\xbf\xbf\xbf\xbf\xbf\xbf"),
+            0x7FFFFFFF,
+            1,
+        ],
+        [ "lowest 32 bit code point",
             (isASCII)
-             ?  "\xfe\x82\x80\x80\x80\x80\x80"
-             : I8_to_native(
+            ?  "\xfe\x82\x80\x80\x80\x80\x80"
+            : I8_to_native(
                 "\xff\xa0\xa0\xa0\xa0\xa0\xa0\xa2\xa0\xa0\xa0\xa0\xa0\xa0"),
-            # This code point is chosen so that it is representable in a UV on
-            # 32-bit machines
-            $::UTF8_WARN_ABOVE_31_BIT, $::UTF8_DISALLOW_ABOVE_31_BIT,
-            $::UTF8_GOT_ABOVE_31_BIT,
-            'utf8', 0x80000000,
-            (isASCII) ? 7 : $::max_bytes,
-            (isASCII) ? 1 : 8,
-            nonportable_regex(0x80000000)
+            ($::is64bit) ? 0x80000000 : -1,   # Overflows on 32-bit systems
+            1,
         ],
-        [ "highest 32 bit code point",
+        # Used when UV_MAX is allowed as a code point
+        #[ "highest 32 bit code point",
+        #    (isASCII)
+        #    ?  "\xfe\x83\xbf\xbf\xbf\xbf\xbf"
+        #    : I8_to_native(
+        #       "\xff\xa0\xa0\xa0\xa0\xa0\xa0\xa3\xbf\xbf\xbf\xbf\xbf\xbf"),
+        #    0xFFFFFFFF,
+        #],
+        #[ "Lowest 33 bit code point",
+        #    (isASCII)
+        #    ?  "\xfe\x84\x80\x80\x80\x80\x80"
+        #    : I8_to_native(
+        #        "\xff\xa0\xa0\xa0\xa0\xa0\xa0\xa4\xa0\xa0\xa0\xa0\xa0\xa0"),
+        #    ($::is64bit) ? 0x100000000 : 0x0,   # Overflows on 32-bit systems
+        #],
+    );
+
+    if (! $::is64bit) {
+        if (isASCII) {
+            push @tests,
+                [ "overlong malformation, but naively looks like overflow",
+                    "\xff\x80\x80\x80\x80\x80\x80\x81\xbf\xbf\xbf\xbf\xbf",
+                    0x7FFFFFFF,
+                ],
+                # Used when above IV_MAX are allowed.
+                #[ "overlong malformation, but naively looks like overflow",
+                #    "\xff\x80\x80\x80\x80\x80\x80\x83\xbf\xbf\xbf\xbf\xbf",
+                #    0xFFFFFFFF,
+                #],
+                [ "overflow that old algorithm failed to detect",
+                    "\xfe\x86\x80\x80\x80\x80\x80",
+                    -1,
+                ];
+        }
+    }
+
+    push @tests,
+        [ "overlong malformation, lowest max-byte",
             (isASCII)
-             ?  "\xfe\x83\xbf\xbf\xbf\xbf\xbf"
+             ?      "\xff\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80"
              : I8_to_native(
-                "\xff\xa0\xa0\xa0\xa0\xa0\xa0\xa3\xbf\xbf\xbf\xbf\xbf\xbf"),
-            $::UTF8_WARN_ABOVE_31_BIT, $::UTF8_DISALLOW_ABOVE_31_BIT,
-            $::UTF8_GOT_ABOVE_31_BIT,
-            'utf8', 0xFFFFFFFF,
-            (isASCII) ? 7 : $::max_bytes,
-            (isASCII) ? 1 : 8,
-            nonportable_regex(0xffffffff)
+                    
"\xff\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0"),
+            0,   # NUL
         ],
-        [ "requires at least 32 bits, and use SUPER-type flags, instead of"
-        . " ABOVE_31_BIT",
-            (isASCII)
-             ? "\xfe\x82\x80\x80\x80\x80\x80"
+        [ "overlong malformation, highest max-byte",
+            (isASCII)    # 2**36-1 on ASCII; 2**30-1 on EBCDIC
+             ?      "\xff\x80\x80\x80\x80\x80\x80\xbf\xbf\xbf\xbf\xbf\xbf"
              : I8_to_native(
-               "\xff\xa0\xa0\xa0\xa0\xa0\xa0\xa2\xa0\xa0\xa0\xa0\xa0\xa0"),
-            $::UTF8_WARN_SUPER, $::UTF8_DISALLOW_SUPER, $::UTF8_GOT_SUPER,
-            'utf8', 0x80000000,
-            (isASCII) ? 7 : $::max_bytes,
-            1,
-            nonportable_regex(0x80000000)
-        ],
-        [ "overflow with warnings/disallow for more than 31 bits",
-            # This tests the interaction of 
WARN_ABOVE_31_BIT/DISALLOW_ABOVE_31_BIT
-            # with overflow.  The overflow malformation is never allowed, so
-            # preventing it takes precedence if the ABOVE_31_BIT options would
-            # otherwise allow in an overflowing value.  The ASCII code points 
(1
-            # for 32-bits; 1 for 64) were chosen because the old overflow
-            # detection algorithm did not catch them; this means this test also
-            # checks for that fix.  The EBCDIC are arbitrary overflowing ones
-            # since we have no reports of failures with it.
-            ((isASCII)
-               ?    "\xff\x80\x90\x90\x90\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf"
-               : I8_to_native(
-                    
"\xff\xB0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0")),
-            $::UTF8_WARN_ABOVE_31_BIT,
-            $::UTF8_DISALLOW_ABOVE_31_BIT,
-            $::UTF8_GOT_ABOVE_31_BIT,
-            'utf8', 0,
-            (! isASCII || $::is64bit) ? $::max_bytes : 7,
-            (isASCII || $::is64bit) ? 2 : 8,
-            qr/overflows/
+                    
"\xff\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xbf\xbf\xbf\xbf\xbf\xbf"),
+            (isASCII) ? (($::is64bit) ? 0xFFFFFFFFF : -1) : 0x3FFFFFFF,
         ];
 
-    if (! isASCII) {
-        push @tests,   # These could falsely show wrongly in a naive
-                       # implementation
-            [ "requires at least 32 bits",
-                I8_to_native(
-                    
"\xff\xa0\xa0\xa0\xa0\xa0\xa1\xa0\xa0\xa0\xa0\xa0\xa0\xa0"),
-                $::UTF8_WARN_ABOVE_31_BIT,$::UTF8_DISALLOW_ABOVE_31_BIT,
-                $::UTF8_GOT_ABOVE_31_BIT,
-                'utf8', 0x800000000,
-                $::max_bytes, 7,
-                nonportable_regex(0x80000000)
+    if (isASCII) {
+        push @tests,
+            [ "Lowest code point requiring 13 bytes to represent", # 2**36
+                "\xff\x80\x80\x80\x80\x80\x81\x80\x80\x80\x80\x80\x80",
+                ($::is64bit) ? 0x1000000000 : -1,    # overflows on 32bit
             ],
-            [ "requires at least 32 bits",
-                I8_to_native(
-                    
"\xff\xa0\xa0\xa0\xa0\xa1\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0"),
-                $::UTF8_WARN_ABOVE_31_BIT,$::UTF8_DISALLOW_ABOVE_31_BIT,
-                $::UTF8_GOT_ABOVE_31_BIT,
-                'utf8', 0x10000000000,
-                $::max_bytes, 6,
-                nonportable_regex(0x10000000000)
+    };
+
+    if ($::is64bit) {
+        push @tests,
+            [ "highest 63 bit code point",
+              (isASCII)
+              ? "\xff\x80\x87\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf"
+              : I8_to_native(
+                "\xff\xa7\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf"),
+              0x7FFFFFFFFFFFFFFF,
+              (isASCII) ? 1 : 2,
             ],
-            [ "requires at least 32 bits",
-                I8_to_native(
+            [ "first 64 bit code point",
+              (isASCII)
+              ? "\xff\x80\x88\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80"
+              : I8_to_native(
+                "\xff\xa8\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0"),
+              -1,
+            ];
+            # Used when UV_MAX is allowed as a code point
+            #[ "highest 64 bit code point",
+            #  (isASCII)
+            #  ? "\xff\x80\x8f\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf"
+            #  : I8_to_native(
+            #    "\xff\xaf\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf"),
+            #  0xFFFFFFFFFFFFFFFF,
+            #  (isASCII) ? 1 : 2,
+            #],
+            #[ "first 65 bit code point",
+            #  (isASCII)
+            #  ? "\xff\x80\x9f\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80"
+            #  : I8_to_native(
+            #    "\xff\xb0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0"),
+            #  0,
+            #];
+        if (isASCII) {
+            push @tests,
+                [ "overflow that old algorithm failed to detect",
+                    "\xff\x80\x90\x90\x90\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf",
+                    -1,
+                ];
+        }
+        else {
+            push @tests,    # These could falsely show wrongly in a naive
+                            # implementation
+                [ "requires at least 32 bits",
+                    I8_to_native(
+                    
"\xff\xa0\xa0\xa0\xa0\xa0\xa1\xa0\xa0\xa0\xa0\xa0\xa0\xa0"),
+                    0x800000000,
+                ],
+                [ "requires at least 32 bits",
+                    I8_to_native(
+                    
"\xff\xa0\xa0\xa0\xa0\xa1\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0"),
+                    0x10000000000,
+                ],
+                [ "requires at least 32 bits",
+                    I8_to_native(
                     
"\xff\xa0\xa0\xa0\xa1\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0"),
-                $::UTF8_WARN_ABOVE_31_BIT,$::UTF8_DISALLOW_ABOVE_31_BIT,
-                $::UTF8_GOT_ABOVE_31_BIT,
-                'utf8', 0x200000000000,
-                $::max_bytes, 5,
-                nonportable_regex(0x20000000000)
-            ],
-            [ "requires at least 32 bits",
-                I8_to_native(
+                    0x200000000000,
+                ],
+                [ "requires at least 32 bits",
+                    I8_to_native(
                     
"\xff\xa0\xa0\xa1\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0"),
-                $::UTF8_WARN_ABOVE_31_BIT,$::UTF8_DISALLOW_ABOVE_31_BIT,
-                $::UTF8_GOT_ABOVE_31_BIT,
-                'utf8', 0x4000000000000,
-                $::max_bytes, 4,
-                nonportable_regex(0x4000000000000)
-            ],
-            [ "requires at least 32 bits",
-                I8_to_native(
+                    0x4000000000000,
+                ],
+                [ "requires at least 32 bits",
+                    I8_to_native(
                     
"\xff\xa0\xa1\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0"),
-                $::UTF8_WARN_ABOVE_31_BIT,$::UTF8_DISALLOW_ABOVE_31_BIT,
-                $::UTF8_GOT_ABOVE_31_BIT,
-                'utf8', 0x80000000000000,
-                $::max_bytes, 3,
-                nonportable_regex(0x80000000000000)
-            ],
-            [ "requires at least 32 bits",
-                I8_to_native(
+                    0x80000000000000,
+                ],
+                [ "requires at least 32 bits",
+                    I8_to_native(
                     
"\xff\xa1\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0"),
-                $::UTF8_WARN_ABOVE_31_BIT,$::UTF8_DISALLOW_ABOVE_31_BIT,
-                $::UTF8_GOT_ABOVE_31_BIT,
-                'utf8', 0x1000000000000000,
-                $::max_bytes, 2,
-                nonportable_regex(0x1000000000000000)
-            ];
+                    0x1000000000000000,
+                ];
+        }
+    }
+}
+
+sub flags_to_text($$)
+{
+    my ($flags, $flags_to_text_ref) = @_;
+
+    # Returns a string containing a mnemonic representation of the bits that
+    # are set in the $flags.  These are assumed to be flag bits.  The return
+    # looks like "FOO|BAR|BAZ".  The second parameter is a reference to an
+    # array that gives the textual representation of all the possible flags.
+    # Element 0 is the text for the bit 0 flag; element 1 for bit 1; ....  If
+    # no bits at all are set the string "0" is returned;
+
+    my @flag_text;
+    my $shift = 0;
+
+    return "0" if $flags == 0;
+
+    while ($flags) {
+        #diag sprintf "%x", $flags;
+        if ($flags & 1) {
+            push @flag_text, $flags_to_text_ref->[$shift];
+        }
+        $shift++;
+        $flags >>= 1;
     }
+
+    return join "|", @flag_text;
+}
+
+# Possible flag returns from utf8n_to_uvchr_error().  These should have G_,
+# instead of A_, D_, but the prefixes will be used in a a later commit, so
+# minimize churn by having them here.
+my @utf8n_flags_to_text =  ( qw(
+        A_EMPTY
+        A_CONTINUATION
+        A_NON_CONTINUATION
+        A_SHORT
+        A_LONG
+        A_LONG_AND_ITS_VALUE
+        PLACEHOLDER
+        A_OVERFLOW
+        D_SURROGATE
+        W_SURROGATE
+        D_NONCHAR
+        W_NONCHAR
+        D_SUPER
+        W_SUPER
+        D_PERL_EXTENDED
+        W_PERL_EXTENDED
+        CHECK_ONLY
+        NO_CONFIDENCE_IN_CURLEN_
+    ) );
+
+sub utf8n_display_call($)
+{
+    # Converts an eval string that calls test_utf8n_to_uvchr into a more human
+    # readable form, and returns it.  Doesn't work if the byte string contains
+    # an apostrophe.  The return will look something like:
+    #   test_utf8n_to_uvchr_error('$bytes', $length, $flags)
+    #diag $_[0];
+
+    $_[0] =~ / ^ ( [^(]* \( ) ' ( [^']*? ) ' ( .+ , \D* ) ( \d+ ) \) $ /x;
+    my $text1 = $1;     # Everything before the byte string
+    my $bytes = $2;
+    my $text2 = $3;     # Includes the length
+    my $flags = $4;
+
+    return $text1
+         . display_bytes($bytes)
+         . $text2
+         . flags_to_text($flags, \@utf8n_flags_to_text)
+         . ')';
+}
+
+sub uvchr_display_call($)
+{
+    # Converts an eval string that calls test_uvchr_to_utf8 into a more human
+    # readable form, and returns it.  The return will look something like:
+    #   test_uvchr_to_utf8n_flags($uv, $flags)
+    #diag $_[0];
+
+    my @flags_to_text =  ( qw(
+            W_SURROGATE
+            W_NONCHAR
**** PATCH TRUNCATED AT 2000 LINES -- 3868 NOT SHOWN ****

--
Perl5 Master Repository

[perl.git] branch blead, updated. v5.27.1-221-gb7d798d66e

Reply via email to