In perl.git, the branch blead has been updated

<http://perl5.git.perl.org/perl.git/commitdiff/977c1d31fff4d41aa42e40c904fe08b509e3a34e?hp=68a91acb87c0e5d700bf9bc52e8374e6ce77f878>

- Log -----------------------------------------------------------------
commit 977c1d31fff4d41aa42e40c904fe08b509e3a34e
Author: Karl Williamson <[email protected]>
Date:   Mon Mar 19 16:31:18 2012 -0600

    Deprecate utf8_to_uvchr() and utf8_to_uvuni()
    
    These functions can read beyond the end of their input strings if
    presented with malformed UTF-8 input.  Perl core code has been converted
    to use other functions instead of these.

M       embed.fnc
M       pod/perldelta.pod
M       proto.h
M       utf8.c

commit 4b88fb76efce8c436e63b907c9842345d4fa77c7
Author: Karl Williamson <[email protected]>
Date:   Mon Mar 19 15:38:06 2012 -0600

    Use the new utf8 to code point functions
    
    These functions should be used in preference to the old ones which can
    read beyond the end of the input string.

M       cygwin/cygwin.c
M       dist/Data-Dumper/Dumper.pm
M       dist/Data-Dumper/Dumper.xs
M       dump.c
M       ext/XS-APItest/APItest.pm
M       ext/XS-APItest/APItest.xs
M       handy.h
M       pod/perlguts.pod
M       pod/perlunicode.pod
M       pp.c
M       regcomp.c
M       symbian/PerlBase.cpp
M       t/lib/warnings/utf8
M       toke.c
M       utf8.c

commit 27d6c58a7e12243bef66c58b38e7d1415d9ca07e
Author: Karl Williamson <[email protected]>
Date:   Mon Mar 19 15:13:19 2012 -0600

    utf8.c: Add valid_utf8_to_uvuni() and valid_utf8_to_uvchr()
    
    These functions are like utf8_to_uvuni() and utf8_to_uvchr(), but their
    name implies that the input UTF-8 has been validated.
    
    They are not currently documented, as it's best for XS writers to call
    the functions that do validation.

M       embed.fnc
M       embed.h
M       proto.h
M       utf8.c

commit ec5f19d09949aac9034bb62ade44ffba8d4d2bb1
Author: Karl Williamson <[email protected]>
Date:   Mon Mar 19 15:03:01 2012 -0600

    utf8.c: Add utf8_to_uvchr_buf() and utf8_to_uvuni_buf()
    
    The existing functions (utf8_to_uvchr and utf8_to_uvuni) have a
    deficiency in that they could read beyond the end of the input string if
    given malformed input.  This commit creates two new functions which
    behave as the old ones did, but have an extra parameter each, which
    gives the upper limit to the string, so no read beyond it is done.

M       embed.fnc
M       embed.h
M       pod/perldelta.pod
M       proto.h
M       utf8.c

commit d0460f306d2b79d09a9e5694f9f72c50a2481b83
Author: Karl Williamson <[email protected]>
Date:   Mon Mar 19 14:48:51 2012 -0600

    utf8.c: pod clarification

M       utf8.c

commit a1433954f53591f4446530df211b86112c6c2446
Author: Karl Williamson <[email protected]>
Date:   Mon Mar 19 13:48:58 2012 -0600

    utf8.c: pod (mostly formatting) + comments changes

M       t/porting/known_pod_issues.dat
M       utf8.c

commit 3c813ed0ab90d1f1f16ca848d265616ae5315536
Author: Karl Williamson <[email protected]>
Date:   Mon Mar 19 13:21:26 2012 -0600

    perlapi (from sv.h) clarifications

M       sv.h

commit ef9741a5409c0b7bf9f7ff7d92654a26eb435a49
Author: Karl Williamson <[email protected]>
Date:   Mon Mar 19 13:01:11 2012 -0600

    autodoc.pl: pod format fix

M       autodoc.pl

commit 95701e00f8e747fe1c24564cb038be375695df5a
Author: Karl Williamson <[email protected]>
Date:   Mon Mar 19 10:52:25 2012 -0600

    perlguts, warnings.t: Update references to obsolete fcn names
    
    These functions were replaced long ago, apparently in 5.8, but I didn't
    verify that for sure.

M       pod/perlguts.pod
M       t/lib/warnings/utf8

commit 8752e5a8bd5a13883a0909d80ec0e205f71f8dbd
Author: Karl Williamson <[email protected]>
Date:   Mon Mar 19 10:44:08 2012 -0600

    perldelta: clarification

M       pod/perldelta.pod
-----------------------------------------------------------------------

Summary of changes:
 autodoc.pl                     |    2 +-
 cygwin/cygwin.c                |    6 +-
 dist/Data-Dumper/Dumper.pm     |    4 +-
 dist/Data-Dumper/Dumper.xs     |   12 +-
 dump.c                         |    4 +-
 embed.fnc                      |    8 +-
 embed.h                        |    4 +
 ext/XS-APItest/APItest.pm      |    2 +-
 ext/XS-APItest/APItest.xs      |    2 +-
 handy.h                        |   27 ++--
 pod/perldelta.pod              |   30 ++++-
 pod/perlguts.pod               |   14 +-
 pod/perlunicode.pod            |    3 +-
 pp.c                           |    6 +-
 proto.h                        |   24 ++++
 regcomp.c                      |   11 +-
 sv.h                           |    6 +-
 symbian/PerlBase.cpp           |    6 +-
 t/lib/warnings/utf8            |    4 +-
 t/porting/known_pod_issues.dat |    2 +-
 toke.c                         |    2 +-
 utf8.c                         |  300 ++++++++++++++++++++++++++-------------
 22 files changed, 321 insertions(+), 158 deletions(-)

diff --git a/autodoc.pl b/autodoc.pl
index 584ee79..3734884 100644
--- a/autodoc.pl
+++ b/autodoc.pl
@@ -382,7 +382,7 @@ perlapi - autogenerated documentation for the perl public 
API
 X<Perl API> X<API> X<api>
 
 This file contains the documentation of the perl public API generated by
-embed.pl, specifically a listing of functions, macros, flags, and variables
+F<embed.pl>, specifically a listing of functions, macros, flags, and variables
 that may be used by extension writers.  L<At the end|/Undocumented functions>
 is a list of functions which have yet to be documented.  The interfaces of
 those are subject to change without notice.  Any functions not listed here are
diff --git a/cygwin/cygwin.c b/cygwin/cygwin.c
index 9419e83..29ee22e 100644
--- a/cygwin/cygwin.c
+++ b/cygwin/cygwin.c
@@ -176,7 +176,7 @@ utf8_to_wide(const char *buf)
 
     setlocale(LC_CTYPE, "utf-8");
     wbuf = (wchar_t *) safemalloc(wlen);
-    /* utf8_to_uvuni(pathname, wpath) or Encoding::_utf8_to_bytes(sv, 
"UCS-2BE"); */
+    /* utf8_to_uvuni_buf(pathname, pathname + wlen, wpath) or 
Encoding::_utf8_to_bytes(sv, "UCS-2BE"); */
     wlen = mbsrtowcs(wbuf, (const char**)&buf, wlen, &mbs);
 
     if (oldlocale) setlocale(LC_CTYPE, oldlocale);
@@ -283,7 +283,7 @@ XS(XS_Cygwin_win_to_posix_path)
            mbstate_t mbs;
             char *oldlocale = setlocale(LC_CTYPE, NULL);
             setlocale(LC_CTYPE, "utf-8");
-           /* utf8_to_uvuni(src_path, wpath) or Encoding::_utf8_to_bytes(sv, 
"UCS-2BE"); */
+           /* utf8_to_uvuni_buf(src_path, src_path + wlen, wpath) or 
Encoding::_utf8_to_bytes(sv, "UCS-2BE"); */
            wlen = mbsrtowcs(wpath, (const char**)&src_path, wlen, &mbs);
            if (wlen > 0)
                err = cygwin_conv_path(what, wpath, wbuf, wlen);
@@ -370,7 +370,7 @@ XS(XS_Cygwin_posix_to_win_path)
        setlocale(LC_CTYPE, "utf-8");
        if (!IN_BYTES) {
            mbstate_t mbs;
-           /* utf8_to_uvuni(src_path, wpath) or Encoding::_utf8_to_bytes(sv, 
"UCS-2BE"); */
+           /* utf8_to_uvuni_buf(src_path, src_path + wlen, wpath) or 
Encoding::_utf8_to_bytes(sv, "UCS-2BE"); */
            wlen = mbsrtowcs(wpath, (const char**)&src_path, wlen, &mbs);
            if (wlen > 0)
                err = cygwin_conv_path(what, wpath, wbuf, wlen);
diff --git a/dist/Data-Dumper/Dumper.pm b/dist/Data-Dumper/Dumper.pm
index 5cff100..a099277 100644
--- a/dist/Data-Dumper/Dumper.pm
+++ b/dist/Data-Dumper/Dumper.pm
@@ -10,7 +10,7 @@
 package Data::Dumper;
 
 BEGIN {
-    $VERSION = '2.135_05'; # Don't forget to set version and release
+    $VERSION = '2.135_06'; # Don't forget to set version and release
 }                         # date in POD!
 
 #$| = 1;
@@ -1332,7 +1332,7 @@ modify it under the same terms as Perl itself.
 
 =head1 VERSION
 
-Version 2.135_05  (February 18 2012)
+Version 2.135_06  (March 20 2012)
 
 =head1 SEE ALSO
 
diff --git a/dist/Data-Dumper/Dumper.xs b/dist/Data-Dumper/Dumper.xs
index 4b7af7c..91e4c6c 100644
--- a/dist/Data-Dumper/Dumper.xs
+++ b/dist/Data-Dumper/Dumper.xs
@@ -37,17 +37,17 @@ static I32 DD_dump (pTHX_ SV *val, const char *name, STRLEN 
namelen, SV *retval,
 # endif
 
 UV
-Perl_utf8_to_uvchr(pTHX_ U8 *s, STRLEN *retlen)
+Perl_utf8_to_uvchr_buf(pTHX_ U8 *s, U8 *send, STRLEN *retlen)
 {
-    const UV uv = utf8_to_uv(s, UTF8_MAXLEN, retlen,
+    const UV uv = utf8_to_uv(s, send - s, retlen,
                     ckWARN(WARN_UTF8) ? 0 : UTF8_ALLOW_ANY);
     return UNI_TO_NATIVE(uv);
 }
 
 # if !defined(PERL_IMPLICIT_CONTEXT)
-#  define utf8_to_uvchr             Perl_utf8_to_uvchr
+#  define utf8_to_uvchr_buf         Perl_utf8_to_uvchr_buf
 # else
-#  define utf8_to_uvchr(a,b) Perl_utf8_to_uvchr(aTHX_ a,b)
+#  define utf8_to_uvchr_buf(a,b) Perl_utf8_to_uvchr_buf(aTHX_ a,b)
 # endif
 
 #endif /* PERL_VERSION <= 6 */
@@ -147,7 +147,7 @@ esc_q_utf8(pTHX_ SV* sv, register const char *src, register 
STRLEN slen)
 
     /* this will need EBCDICification */
     for (s = src; s < send; s += increment) {
-        const UV k = utf8_to_uvchr((U8*)s, NULL);
+        const UV k = utf8_to_uvchr_buf((U8*)s, (U8*) send, NULL);
 
         /* check for invalid utf8 */
         increment = (k == 0 && *s != '\0') ? 1 : UTF8SKIP(s);
@@ -184,7 +184,7 @@ esc_q_utf8(pTHX_ SV* sv, register const char *src, register 
STRLEN slen)
         *r++ = '"';
 
         for (s = src; s < send; s += UTF8SKIP(s)) {
-            const UV k = utf8_to_uvchr((U8*)s, NULL);
+            const UV k = utf8_to_uvchr_buf((U8*)s, (U8*) send, NULL);
 
             if (k == '"' || k == '\\' || k == '$' || k == '@') {
                 *r++ = '\\';
diff --git a/dump.c b/dump.c
index 2c635de..b238ee0 100644
--- a/dump.c
+++ b/dump.c
@@ -281,7 +281,7 @@ Perl_pv_escape( pTHX_ SV *dsv, char const * const str,
         isuni = 1;
     
     for ( ; (pv < end && (!max || (wrote < max))) ; pv += readsize ) {
-        const UV u= (isuni) ? utf8_to_uvchr((U8*)pv, &readsize) : (U8)*pv;     
       
+        const UV u= (isuni) ? utf8_to_uvchr_buf((U8*)pv, (U8*) end, &readsize) 
: (U8)*pv;
         const U8 c = (U8)u & 0xFF;
         
         if ( ( u > 255 )
@@ -2420,7 +2420,7 @@ Perl_sv_catxmlpvn(pTHX_ SV *dsv, const char *pv, STRLEN 
len, int utf8)
   retry:
     while (pv < e) {
        if (utf8) {
-           c = utf8_to_uvchr((U8*)pv, &cl);
+           c = utf8_to_uvchr_buf((U8*)pv, (U8*)e, &cl);
            if (cl == 0) {
                SvCUR(dsv) = dsvcur;
                pv = start;
diff --git a/embed.fnc b/embed.fnc
index c549dc9..f9d214d 100644
--- a/embed.fnc
+++ b/embed.fnc
@@ -1448,8 +1448,12 @@ Apd      |int    |bytes_cmp_utf8 |NN const U8 *b|STRLEN 
blen|NN const U8 *u \
                                |STRLEN ulen
 ApMd   |U8*    |bytes_from_utf8|NN const U8 *s|NN STRLEN *len|NULLOK bool 
*is_utf8
 ApMd   |U8*    |bytes_to_utf8  |NN const U8 *s|NN STRLEN *len
-Apd    |UV     |utf8_to_uvchr  |NN const U8 *s|NULLOK STRLEN *retlen
-Apd    |UV     |utf8_to_uvuni  |NN const U8 *s|NULLOK STRLEN *retlen
+ApdD   |UV     |utf8_to_uvchr  |NN const U8 *s|NULLOK STRLEN *retlen
+ApdD   |UV     |utf8_to_uvuni  |NN const U8 *s|NULLOK STRLEN *retlen
+ApdM   |UV     |valid_utf8_to_uvchr    |NN const U8 *s|NULLOK STRLEN *retlen
+ApdM   |UV     |valid_utf8_to_uvuni    |NN const U8 *s|NULLOK STRLEN *retlen
+Apd    |UV     |utf8_to_uvchr_buf      |NN const U8 *s|NN const U8 
*send|NULLOK STRLEN *retlen
+Apd    |UV     |utf8_to_uvuni_buf      |NN const U8 *s|NN const U8 
*send|NULLOK STRLEN *retlen
 pM     |bool   |check_utf8_print       |NN const U8 *s|const STRLEN len
 
 #ifdef EBCDIC
diff --git a/embed.h b/embed.h
index 1d1e598..31e024c 100644
--- a/embed.h
+++ b/embed.h
@@ -672,10 +672,14 @@
 #define utf8_length(a,b)       Perl_utf8_length(aTHX_ a,b)
 #define utf8_to_bytes(a,b)     Perl_utf8_to_bytes(aTHX_ a,b)
 #define utf8_to_uvchr(a,b)     Perl_utf8_to_uvchr(aTHX_ a,b)
+#define utf8_to_uvchr_buf(a,b,c)       Perl_utf8_to_uvchr_buf(aTHX_ a,b,c)
 #define utf8_to_uvuni(a,b)     Perl_utf8_to_uvuni(aTHX_ a,b)
+#define utf8_to_uvuni_buf(a,b,c)       Perl_utf8_to_uvuni_buf(aTHX_ a,b,c)
 #define utf8n_to_uvuni(a,b,c,d)        Perl_utf8n_to_uvuni(aTHX_ a,b,c,d)
 #define uvchr_to_utf8_flags(a,b,c)     Perl_uvchr_to_utf8_flags(aTHX_ a,b,c)
 #define uvuni_to_utf8_flags(a,b,c)     Perl_uvuni_to_utf8_flags(aTHX_ a,b,c)
+#define valid_utf8_to_uvchr(a,b)       Perl_valid_utf8_to_uvchr(aTHX_ a,b)
+#define valid_utf8_to_uvuni(a,b)       Perl_valid_utf8_to_uvuni(aTHX_ a,b)
 #define vcmp(a,b)              Perl_vcmp(aTHX_ a,b)
 #define vcroak(a,b)            Perl_vcroak(aTHX_ a,b)
 #define vdeb(a,b)              Perl_vdeb(aTHX_ a,b)
diff --git a/ext/XS-APItest/APItest.pm b/ext/XS-APItest/APItest.pm
index 7e7e8de..78d77f1 100644
--- a/ext/XS-APItest/APItest.pm
+++ b/ext/XS-APItest/APItest.pm
@@ -5,7 +5,7 @@ use strict;
 use warnings;
 use Carp;
 
-our $VERSION = '0.36';
+our $VERSION = '0.37';
 
 require XSLoader;
 
diff --git a/ext/XS-APItest/APItest.xs b/ext/XS-APItest/APItest.xs
index 6e8689c..5105960 100644
--- a/ext/XS-APItest/APItest.xs
+++ b/ext/XS-APItest/APItest.xs
@@ -148,7 +148,7 @@ bitflip_key(pTHX_ IV action, SV *field) {
                const char *const end = p + len;
                while (p < end) {
                    STRLEN len;
-                   UV chr = utf8_to_uvuni((U8 *)p, &len);
+                   UV chr = utf8_to_uvuni_buf((U8 *)p, (U8 *) end, &len);
                    new_p = (char *)uvuni_to_utf8((U8 *)new_p, chr ^ 32);
                    p += len;
                }
diff --git a/handy.h b/handy.h
index c437447..c90a876 100644
--- a/handy.h
+++ b/handy.h
@@ -949,7 +949,8 @@ EXTCONST U32 PL_charclass[];
                                                                    *((p)+1)))  
\
                                            : function(p))
 
-/* Note that all ignore 'use bytes' */
+/* Note that all assume that the utf8 has been validated, and ignore 'use
+ * bytes' */
 
 #define isALNUM_utf8(p)                generic_utf8(isWORDCHAR, is_utf8_alnum, 
p)
 /* To prevent S_scan_word in toke.c from hanging, we have to make sure that
@@ -992,18 +993,18 @@ EXTCONST U32 PL_charclass[];
                                   : isSPACE_utf8(p)))
 #define isBLANK_utf8(c)                isBLANK(c) /* could be wrong */
 
-#define isALNUM_LC_utf8(p)     isALNUM_LC_uvchr(utf8_to_uvchr(p,  0))
-#define isIDFIRST_LC_utf8(p)   isIDFIRST_LC_uvchr(utf8_to_uvchr(p,  0))
-#define isALPHA_LC_utf8(p)     isALPHA_LC_uvchr(utf8_to_uvchr(p,  0))
-#define isSPACE_LC_utf8(p)     isSPACE_LC_uvchr(utf8_to_uvchr(p,  0))
-#define isDIGIT_LC_utf8(p)     isDIGIT_LC_uvchr(utf8_to_uvchr(p,  0))
-#define isUPPER_LC_utf8(p)     isUPPER_LC_uvchr(utf8_to_uvchr(p,  0))
-#define isLOWER_LC_utf8(p)     isLOWER_LC_uvchr(utf8_to_uvchr(p,  0))
-#define isALNUMC_LC_utf8(p)    isALNUMC_LC_uvchr(utf8_to_uvchr(p,  0))
-#define isCNTRL_LC_utf8(p)     isCNTRL_LC_uvchr(utf8_to_uvchr(p,  0))
-#define isGRAPH_LC_utf8(p)     isGRAPH_LC_uvchr(utf8_to_uvchr(p,  0))
-#define isPRINT_LC_utf8(p)     isPRINT_LC_uvchr(utf8_to_uvchr(p,  0))
-#define isPUNCT_LC_utf8(p)     isPUNCT_LC_uvchr(utf8_to_uvchr(p,  0))
+#define isALNUM_LC_utf8(p)     isALNUM_LC_uvchr(valid_utf8_to_uvchr(p,  0))
+#define isIDFIRST_LC_utf8(p)   isIDFIRST_LC_uvchr(valid_utf8_to_uvchr(p,  0))
+#define isALPHA_LC_utf8(p)     isALPHA_LC_uvchr(valid_utf8_to_uvchr(p,  0))
+#define isSPACE_LC_utf8(p)     isSPACE_LC_uvchr(valid_utf8_to_uvchr(p,  0))
+#define isDIGIT_LC_utf8(p)     isDIGIT_LC_uvchr(valid_utf8_to_uvchr(p,  0))
+#define isUPPER_LC_utf8(p)     isUPPER_LC_uvchr(valid_utf8_to_uvchr(p,  0))
+#define isLOWER_LC_utf8(p)     isLOWER_LC_uvchr(valid_utf8_to_uvchr(p,  0))
+#define isALNUMC_LC_utf8(p)    isALNUMC_LC_uvchr(valid_utf8_to_uvchr(p,  0))
+#define isCNTRL_LC_utf8(p)     isCNTRL_LC_uvchr(valid_utf8_to_uvchr(p,  0))
+#define isGRAPH_LC_utf8(p)     isGRAPH_LC_uvchr(valid_utf8_to_uvchr(p,  0))
+#define isPRINT_LC_utf8(p)     isPRINT_LC_uvchr(valid_utf8_to_uvchr(p,  0))
+#define isPUNCT_LC_utf8(p)     isPUNCT_LC_uvchr(valid_utf8_to_uvchr(p,  0))
 
 #define isPSXSPC_LC_utf8(c)    (isSPACE_LC_utf8(c) ||(c) == '\f')
 #define isBLANK_LC_utf8(c)     isBLANK(c) /* could be wrong */
diff --git a/pod/perldelta.pod b/pod/perldelta.pod
index 1f2f16f..b1c96c8 100644
--- a/pod/perldelta.pod
+++ b/pod/perldelta.pod
@@ -28,7 +28,12 @@ write C<< no feature ':all' >>.
 
 =head1 Security
 
-There have been no security related fixed between 5.15.8 and 5.15.9.
+=head2 Malformed UTF-8 input could cause attempts to read beyond the end of 
the buffer
+
+Two new XS-accessible functions, C<utf8_to_uvchr_buf()> and
+C<utf8_to_uvuni_buf()> are now available to prevent this, and the Perl
+core has been converted to use them.
+See L</Internal Changes>.
 
 =head1 Incompatible Changes
 
@@ -44,6 +49,11 @@ It has been documented that the current plans include 
requiring a
 literal C<< "{" >> to be escaped: 5.18 will emit deprecation warnings,
 and it will be required in 5.20.
 
+=head2 XS functions C<utf8_to_uvchr()> and C<utf8_to_uvuni()>
+
+Use C<utf8_to_uvchr_buf()> and C<utf8_to_uvuni_buf()> instead.
+See L</Internal Changes>.
+
 =head1 Performance Enhancements
 
 =over 4
@@ -176,8 +186,18 @@ There have been no changes to Perl's support of various 
platforms between
 
 =head1 Internal Changes
 
-There has been no change that affects the interface available to C<< XS >>
-between 5.15.8 and 5.15.9.
+=over 4
+
+=item *
+
+Two new functions C<utf8_to_uvchr_buf()> and C<utf8_to_uvuni_buf()> have
+been added.  These are the same as C<utf8_to_uvchr> and
+C<utf8_to_uvuni> (which are now deprecated), but take an extra parameter
+that is used to guard against reading beyond the end of the input
+string.
+See L<perlapi/utf8_to_uvchr_buf> and L<perlapi/utf8_to_uvuni_buf>.
+
+=back
 
 =head1 Selected Bug Fixes
 
@@ -185,8 +205,8 @@ between 5.15.8 and 5.15.9.
 
 =item *
 
-I<Takri> is now considered a script that uses two characters. This corrects
-a Unicode 6.1 omission.
+I<Takri> now matches two more characters under the C<Script_Extensions>
+property. This corrects a Unicode 6.1 omission.
 
 =item *
 
diff --git a/pod/perlguts.pod b/pod/perlguts.pod
index e585171..908fa1f 100644
--- a/pod/perlguts.pod
+++ b/pod/perlguts.pod
@@ -2670,22 +2670,24 @@ character like this (the UTF8_IS_INVARIANT() is a macro 
that tests
 whether the byte can be encoded as a single byte even in UTF-8):
 
     U8 *utf;
+    U8 *utf_end; /* 1 beyond buffer pointed to by utf */
     UV uv;     /* Note: a UV, not a U8, not a char */
+    STRLEN len; /* length of character in bytes */
 
     if (!UTF8_IS_INVARIANT(*utf))
         /* Must treat this as UTF-8 */
-        uv = utf8_to_uv(utf);
+        uv = utf8_to_uvchr_buf(utf, utf_end, &len);
     else
         /* OK to treat this character as a byte */
         uv = *utf;
 
-You can also see in that example that we use C<utf8_to_uv> to get the
-value of the character; the inverse function C<uv_to_utf8> is available
+You can also see in that example that we use C<utf8_to_uvchr_buf> to get the
+value of the character; the inverse function C<uvchr_to_utf8> is available
 for putting a UV into UTF-8:
 
     if (!UTF8_IS_INVARIANT(uv))
         /* Must treat this as UTF8 */
-        utf8 = uv_to_utf8(utf8, uv);
+        utf8 = uvchr_to_utf8(utf8, uv);
     else
         /* OK to treat this character as a byte */
         *utf8++ = uv;
@@ -2791,13 +2793,13 @@ it's not - if you pass on the PV to somewhere, pass on 
the flag too.
 
 =item *
 
-If a string is UTF-8, B<always> use C<utf8_to_uv> to get at the value,
+If a string is UTF-8, B<always> use C<utf8_to_uvchr_buf> to get at the value,
 unless C<UTF8_IS_INVARIANT(*s)> in which case you can use C<*s>.
 
 =item *
 
 When writing a character C<uv> to a UTF-8 string, B<always> use
-C<uv_to_utf8>, unless C<UTF8_IS_INVARIANT(uv))> in which case
+C<uvchr_to_utf8>, unless C<UTF8_IS_INVARIANT(uv))> in which case
 you can use C<*s = uv>.
 
 =item *
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index b96efbf..74c1666 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -1514,7 +1514,8 @@ pointing after the UTF-8 bytes.  It works appropriately 
on EBCDIC machines.
 
 =item *
 
-C<utf8_to_uvchr(buf, lenp)> reads UTF-8 encoded bytes from a buffer and
+C<utf8_to_uvchr_buf(buf, bufend, lenp)> reads UTF-8 encoded bytes from a
+buffer and
 returns the Unicode character code point and, optionally, the length of
 the UTF-8 byte sequence.  It works appropriately on EBCDIC machines.
 
diff --git a/pp.c b/pp.c
index f3c4ebb..ba3ac1f 100644
--- a/pp.c
+++ b/pp.c
@@ -3383,7 +3383,7 @@ PP(pp_chr)
         sv_recode_to_utf8(TARG, PL_encoding);
        tmps = SvPVX(TARG);
        if (SvCUR(TARG) == 0 || !is_utf8_string((U8*)tmps, SvCUR(TARG)) ||
-           UNICODE_IS_REPLACEMENT(utf8_to_uvchr((U8*)tmps, NULL))) {
+           UNICODE_IS_REPLACEMENT(utf8_to_uvchr_buf((U8*)tmps, (U8*) tmps + 
SvCUR(TARG), NULL))) {
            SvGROW(TARG, 2);
            tmps = SvPVX(TARG);
            SvCUR_set(TARG, 1);
@@ -3795,7 +3795,7 @@ PP(pp_uc)
             uv = _to_utf8_upper_flags(s, tmpbuf, &ulen,
                                      cBOOL(IN_LOCALE_RUNTIME), &tainted);
             if (uv == GREEK_CAPITAL_LETTER_IOTA
-                && utf8_to_uvchr(s, 0) == COMBINING_GREEK_YPOGEGRAMMENI)
+                && utf8_to_uvchr_buf(s, send, 0) == 
COMBINING_GREEK_YPOGEGRAMMENI)
             {
                 in_iota_subscript = TRUE;
             }
@@ -5344,7 +5344,7 @@ PP(pp_reverse)
                        continue;
                    }
                    else {
-                       if (!utf8_to_uvchr(s, 0))
+                       if (!utf8_to_uvchr_buf(s, send, 0))
                            break;
                        up = (char*)s;
                        s += UTF8SKIP(s);
diff --git a/proto.h b/proto.h
index b811e6b..d8978c6 100644
--- a/proto.h
+++ b/proto.h
@@ -4564,15 +4564,29 @@ PERL_CALLCONV U8*       Perl_utf8_to_bytes(pTHX_ U8 *s, 
STRLEN *len)
        assert(s); assert(len)
 
 PERL_CALLCONV UV       Perl_utf8_to_uvchr(pTHX_ const U8 *s, STRLEN *retlen)
+                       __attribute__deprecated__
                        __attribute__nonnull__(pTHX_1);
 #define PERL_ARGS_ASSERT_UTF8_TO_UVCHR \
        assert(s)
 
+PERL_CALLCONV UV       Perl_utf8_to_uvchr_buf(pTHX_ const U8 *s, const U8 
*send, STRLEN *retlen)
+                       __attribute__nonnull__(pTHX_1)
+                       __attribute__nonnull__(pTHX_2);
+#define PERL_ARGS_ASSERT_UTF8_TO_UVCHR_BUF     \
+       assert(s); assert(send)
+
 PERL_CALLCONV UV       Perl_utf8_to_uvuni(pTHX_ const U8 *s, STRLEN *retlen)
+                       __attribute__deprecated__
                        __attribute__nonnull__(pTHX_1);
 #define PERL_ARGS_ASSERT_UTF8_TO_UVUNI \
        assert(s)
 
+PERL_CALLCONV UV       Perl_utf8_to_uvuni_buf(pTHX_ const U8 *s, const U8 
*send, STRLEN *retlen)
+                       __attribute__nonnull__(pTHX_1)
+                       __attribute__nonnull__(pTHX_2);
+#define PERL_ARGS_ASSERT_UTF8_TO_UVUNI_BUF     \
+       assert(s); assert(send)
+
 PERL_CALLCONV UV       Perl_utf8n_to_uvuni(pTHX_ const U8 *s, STRLEN curlen, 
STRLEN *retlen, U32 flags)
                        __attribute__nonnull__(pTHX_1);
 #define PERL_ARGS_ASSERT_UTF8N_TO_UVUNI        \
@@ -4593,6 +4607,16 @@ PERL_CALLCONV U8*        Perl_uvuni_to_utf8_flags(pTHX_ 
U8 *d, UV uv, UV flags)
 #define PERL_ARGS_ASSERT_UVUNI_TO_UTF8_FLAGS   \
        assert(d)
 
+PERL_CALLCONV UV       Perl_valid_utf8_to_uvchr(pTHX_ const U8 *s, STRLEN 
*retlen)
+                       __attribute__nonnull__(pTHX_1);
+#define PERL_ARGS_ASSERT_VALID_UTF8_TO_UVCHR   \
+       assert(s)
+
+PERL_CALLCONV UV       Perl_valid_utf8_to_uvuni(pTHX_ const U8 *s, STRLEN 
*retlen)
+                       __attribute__nonnull__(pTHX_1);
+#define PERL_ARGS_ASSERT_VALID_UTF8_TO_UVUNI   \
+       assert(s)
+
 PERL_CALLCONV int      Perl_vcmp(pTHX_ SV *lhv, SV *rhv)
                        __attribute__nonnull__(pTHX_1)
                        __attribute__nonnull__(pTHX_2);
diff --git a/regcomp.c b/regcomp.c
index e3da6e9..8c287bf 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -3480,8 +3480,8 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode 
**scanp,
            UV uc;
            if (UTF) {
                const U8 * const s = (U8*)STRING(scan);
+               uc = utf8_to_uvchr_buf(s, s + l, NULL);
                l = utf8_length(s, s + l);
-               uc = utf8_to_uvchr(s, NULL);
            } else {
                uc = *((U8*)STRING(scan));
            }
@@ -3575,8 +3575,8 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode 
**scanp,
            }
            if (UTF) {
                const U8 * const s = (U8 *)STRING(scan);
+               uc = utf8_to_uvchr_buf(s, s + l, NULL);
                l = utf8_length(s, s + l);
-               uc = utf8_to_uvchr(s, NULL);
            }
            else if (has_exactf_sharp_s) {
                RExC_seen |= REG_SEEN_EXACTF_SHARP_S;
@@ -9822,7 +9822,10 @@ tryagain:
                              for (foldbuf = tmpbuf;
                                   foldlen;
                                   foldlen -= numlen) {
-                                  ender = utf8_to_uvchr(foldbuf, &numlen);
+
+                                  /* tmpbuf has been constructed by us, so we
+                                   * know it is valid utf8 */
+                                  ender = valid_utf8_to_uvchr(foldbuf, 
&numlen);
                                   if (numlen > 0) {
                                        const STRLEN unilen = 
reguni(pRExC_state, ender, s);
                                        s       += unilen;
@@ -9858,7 +9861,7 @@ tryagain:
                          for (foldbuf = tmpbuf;
                               foldlen;
                               foldlen -= numlen) {
-                              ender = utf8_to_uvchr(foldbuf, &numlen);
+                              ender = valid_utf8_to_uvchr(foldbuf, &numlen);
                               if (numlen > 0) {
                                    const STRLEN unilen = reguni(pRExC_state, 
ender, s);
                                    len     += unilen;
diff --git a/sv.h b/sv.h
index 60ff740..7f79c01 100644
--- a/sv.h
+++ b/sv.h
@@ -806,7 +806,8 @@ Set the actual length of the string which is in the SV.  
See C<SvIV_set>.
 
 /*
 =for apidoc Am|U32|SvUTF8|SV* sv
-Returns a U32 value indicating whether the SV contains UTF-8 encoded data.
+Returns a U32 value indicating the UTF-8 status of an SV.  If things are set-up
+properly, this indicates whether or not the SV contains UTF-8 encoded data.
 Call this after SvPV() in case any call to string overloading updates the
 internal flag.
 
@@ -815,7 +816,8 @@ Turn on the UTF-8 status of an SV (the data is not changed, 
just the flag).
 Do not use frivolously.
 
 =for apidoc Am|void|SvUTF8_off|SV *sv
-Unsets the UTF-8 status of an SV.
+Unsets the UTF-8 status of an SV (the data is not changed, just the flag).
+Do not use frivolously.
 
 =for apidoc Am|void|SvPOK_only_UTF8|SV* sv
 Tells an SV that it is a string and disables all other OK bits,
diff --git a/symbian/PerlBase.cpp b/symbian/PerlBase.cpp
index 4162e57..9312abe 100644
--- a/symbian/PerlBase.cpp
+++ b/symbian/PerlBase.cpp
@@ -364,7 +364,9 @@ int CPerlBase::ConsoleRead(const int fd, char* buf, int n)
 #else
     dTHX;
     for (i = 0; i < nUtf8; i+= UTF8SKIP(pUtf8 + i)) {
-        unsigned long u = utf8_to_uvchr((U8*)(pUtf8 + i), 0);
+        unsigned long u = utf8_to_uvchr_buf((U8*)(pUtf8 + i),
+                                            (U8*)(pUtf8 + nUtf8),
+                                            0);
         if (u > 0xFF) {
             iConsole->Printf(_L("(keycode > 0xFF)\n"));
             buf[i] = 0;
@@ -401,7 +403,7 @@ int CPerlBase::ConsoleWrite(const int fd, const char* buf, 
int n)
     dTHX;
     if (is_utf8_string((U8*)buf, n)) {
         for (int i = 0; i < n; i += UTF8SKIP(buf + i)) {
-            TChar u = utf8_to_uvchr((U8*)(buf + i), 0);
+            TChar u = valid_utf8_to_uvchr((U8*)(buf + i), 0);
             iConsole->Printf(_L("%c"), u);
             wrote++;
         }
diff --git a/t/lib/warnings/utf8 b/t/lib/warnings/utf8
index 7d3886c..f6fa8f2 100644
--- a/t/lib/warnings/utf8
+++ b/t/lib/warnings/utf8
@@ -1,7 +1,7 @@
 
   utf8.c AOK
 
-     [utf8_to_uv]
+     [utf8_to_uvchr_buf]
      Malformed UTF-8 character
        my $a = ord "\x80" ;
 
@@ -14,7 +14,7 @@
      <<<<<< Add a test when something actually calls utf16_to_utf8
 
 __END__
-# utf8.c [utf8_to_uv] -W
+# utf8.c [utf8_to_uvchr_buf] -W
 BEGIN {
     if (ord('A') == 193) {
         print "SKIPPED\n# ebcdic platforms do not generate Malformed UTF-8 
warnings.";
diff --git a/t/porting/known_pod_issues.dat b/t/porting/known_pod_issues.dat
index ed33802..4779e23 100644
--- a/t/porting/known_pod_issues.dat
+++ b/t/porting/known_pod_issues.dat
@@ -204,7 +204,7 @@ os2/os2/os2-rexx/dll/dll.pm Verbatim line length including 
indents exceeds 79 by
 os2/os2/os2-rexx/rexx.pm       Verbatim line length including indents exceeds 
79 by    1
 pod/perl.pod   Verbatim line length including indents exceeds 79 by    9
 pod/perlaix.pod        Verbatim line length including indents exceeds 79 by    
11
-pod/perlapi.pod        ? Should you be using L<...> instead of 86
+pod/perlapi.pod        ? Should you be using L<...> instead of 85
 pod/perlapi.pod        Verbatim line length including indents exceeds 79 by    
6
 pod/perlapi.pod        unresolved internal link        3
 pod/perlapio.pod       Verbatim line length including indents exceeds 79 by    
5
diff --git a/toke.c b/toke.c
index 829ff86..58142ab 100644
--- a/toke.c
+++ b/toke.c
@@ -9883,7 +9883,7 @@ S_scan_str(pTHX_ char *start, int keep_quoted, int 
keep_delims)
        termlen = 1;
     }
     else {
-       termcode = utf8_to_uvchr((U8*)s, &termlen);
+       termcode = utf8_to_uvchr_buf((U8*)s, (U8*)PL_bufend, &termlen);
        Copy(s, termstr, termlen, U8);
        if (!UTF8_IS_INVARIANT(term))
            has_utf8 = TRUE;
diff --git a/utf8.c b/utf8.c
index 2b1e99b..1d646a8 100644
--- a/utf8.c
+++ b/utf8.c
@@ -57,14 +57,14 @@ within non-zero characters.
 /*
 =for apidoc is_ascii_string
 
-Returns true if the first C<len> bytes of the given string are the same whether
+Returns true if the first C<len> bytes of the string C<s> are the same whether
 or not the string is encoded in UTF-8 (or UTF-EBCDIC on EBCDIC machines).  That
 is, if they are invariant.  On ASCII-ish machines, only ASCII characters
 fit this definition, hence the function's name.
 
 If C<len> is 0, it will be calculated using C<strlen(s)>.  
 
-See also is_utf8_string(), is_utf8_string_loclen(), and is_utf8_string_loc().
+See also L</is_utf8_string>(), L</is_utf8_string_loclen>(), and 
L</is_utf8_string_loc>().
 
 =cut
 */
@@ -109,7 +109,8 @@ This is the recommended Unicode-aware way of saying
 
 This function will convert to UTF-8 (and not warn) even code points that aren't
 legal Unicode or are problematic, unless C<flags> contains one or more of the
-following flags.
+following flags:
+
 If C<uv> is a Unicode surrogate code point and UNICODE_WARN_SURROGATE is set,
 the function will raise a warning, provided UTF8 warnings are enabled.  If 
instead
 UNICODE_DISALLOW_SURROGATE is set, the function will fail and return NULL.
@@ -363,7 +364,7 @@ character is a valid UTF-8 character.  The actual number of 
bytes in the UTF-8
 character will be returned if it is valid, otherwise 0.
 
 This function is deprecated due to the possibility that malformed input could
-cause reading beyond the end of the input buffer.  Use C<is_utf8_char_buf>
+cause reading beyond the end of the input buffer.  Use L</is_utf8_char_buf>
 instead.
 
 =cut */
@@ -381,13 +382,13 @@ Perl_is_utf8_char(const U8 *s)
 /*
 =for apidoc is_utf8_string
 
-Returns true if first C<len> bytes of the given string form a valid
+Returns true if the first C<len> bytes of string C<s> form a valid
 UTF-8 string, false otherwise.  If C<len> is 0, it will be calculated
 using C<strlen(s)> (which means if you use this option, that C<s> has to have a
 terminating NUL byte).  Note that all characters being ASCII constitute 'a
 valid UTF-8 string'.
 
-See also is_ascii_string(), is_utf8_string_loclen(), and is_utf8_string_loc().
+See also L</is_ascii_string>(), L</is_utf8_string_loclen>(), and 
L</is_utf8_string_loc>().
 
 =cut
 */
@@ -435,20 +436,20 @@ Implemented as a macro in utf8.h
 
 =for apidoc is_utf8_string_loc
 
-Like is_utf8_string() but stores the location of the failure (in the
-case of "utf8ness failure") or the location s+len (in the case of
+Like L</is_utf8_string> but stores the location of the failure (in the
+case of "utf8ness failure") or the location C<s>+C<len> (in the case of
 "utf8ness success") in the C<ep>.
 
-See also is_utf8_string_loclen() and is_utf8_string().
+See also L</is_utf8_string_loclen>() and L</is_utf8_string>().
 
 =for apidoc is_utf8_string_loclen
 
-Like is_utf8_string() but stores the location of the failure (in the
-case of "utf8ness failure") or the location s+len (in the case of
+Like L</is_utf8_string>() but stores the location of the failure (in the
+case of "utf8ness failure") or the location C<s>+C<len> (in the case of
 "utf8ness success") in the C<ep>, and the number of UTF-8
 encoded characters in the C<el>.
 
-See also is_utf8_string_loc() and is_utf8_string().
+See also L</is_utf8_string_loc>() and L</is_utf8_string>().
 
 =cut
 */
@@ -548,7 +549,8 @@ UTF8_CHECK_ONLY is also specified.)
 
 Very large code points (above 0x7FFF_FFFF) are considered more problematic than
 the others that are above the Unicode legal maximum.  There are several
-reasons, one of which is that the original UTF-8 specification never went above
+reasons: they do not fit into a 32-bit word, are not representable on EBCDIC
+platforms, and the original UTF-8 specification never went above
 this number (the current 0x10FFF limit was imposed later).  The UTF-8 encoding
 on ASCII platforms for these large code points begins with a byte containing
 0xFE or 0xFF.  The UTF8_DISALLOW_FE_FF flag will cause them to be treated as
@@ -561,7 +563,7 @@ All other code points corresponding to Unicode characters, 
including private
 use and those yet to be assigned, are never considered malformed and never
 warn.
 
-Most code should use utf8_to_uvchr() rather than call this directly.
+Most code should use L</utf8_to_uvchr_buf>() rather than call this directly.
 
 =cut
 */
@@ -793,40 +795,126 @@ malformed:
 }
 
 /*
+=for apidoc utf8_to_uvchr_buf
+
+Returns the native code point of the first character in the string C<s> which
+is assumed to be in UTF-8 encoding; C<send> points to 1 beyond the end of C<s>.
+C<retlen> will be set to the length, in bytes, of that character.
+
+If C<s> does not point to a well-formed UTF-8 character, zero is
+returned and C<retlen> is set, if possible, to -1.
+
+=cut
+*/
+
+
+UV
+Perl_utf8_to_uvchr_buf(pTHX_ const U8 *s, const U8 *send, STRLEN *retlen)
+{
+    PERL_ARGS_ASSERT_UTF8_TO_UVCHR_BUF;
+
+    assert(s < send);
+
+    return utf8n_to_uvchr(s, send - s, retlen,
+                         ckWARN_d(WARN_UTF8) ? 0 : UTF8_ALLOW_ANY);
+}
+
+/* Like L</utf8_to_uvchr_buf>(), but should only be called when it is known 
that
+ * there are no malformations in the input UTF-8 string C<s>.  Currently, some
+ * malformations are checked for, but this checking likely will be removed in
+ * the future */
+
+UV
+Perl_valid_utf8_to_uvchr(pTHX_ const U8 *s, STRLEN *retlen)
+{
+    PERL_ARGS_ASSERT_VALID_UTF8_TO_UVCHR;
+
+    return utf8_to_uvchr_buf(s, s + UTF8_MAXBYTES, retlen);
+}
+
+/*
 =for apidoc utf8_to_uvchr
 
+DEPRECATED!
+
 Returns the native code point of the first character in the string C<s>
 which is assumed to be in UTF-8 encoding; C<retlen> will be set to the
 length, in bytes, of that character.
 
-If C<s> does not point to a well-formed UTF-8 character, zero is
-returned and retlen is set, if possible, to -1.
+Some, but not all, UTF-8 malformations are detected, and in fact, some
+malformed input could cause reading beyond the end of the input buffer, which
+is why this function is deprecated.  Use L</utf8_to_uvchr_buf> instead.
+
+If C<s> points to one of the detected malformations, zero is
+returned and C<retlen> is set, if possible, to -1.
 
 =cut
 */
 
-
 UV
 Perl_utf8_to_uvchr(pTHX_ const U8 *s, STRLEN *retlen)
 {
     PERL_ARGS_ASSERT_UTF8_TO_UVCHR;
 
-    return utf8n_to_uvchr(s, UTF8_MAXBYTES, retlen,
-                         ckWARN_d(WARN_UTF8) ? 0 : UTF8_ALLOW_ANY);
+    return valid_utf8_to_uvchr(s, retlen);
+}
+
+/*
+=for apidoc utf8_to_uvuni_buf
+
+Returns the Unicode code point of the first character in the string C<s> which
+is assumed to be in UTF-8 encoding; C<send> points to 1 beyond the end of C<s>.
+C<retlen> will be set to the length, in bytes, of that character.
+
+This function should only be used when the returned UV is considered
+an index into the Unicode semantic tables (e.g. swashes).
+
+If C<s> does not point to a well-formed UTF-8 character, zero is
+returned and C<retlen> is set, if possible, to -1.
+
+=cut
+*/
+
+UV
+Perl_utf8_to_uvuni_buf(pTHX_ const U8 *s, const U8 *send, STRLEN *retlen)
+{
+    PERL_ARGS_ASSERT_UTF8_TO_UVUNI_BUF;
+
+    assert(send > s);
+
+    /* Call the low level routine asking for checks */
+    return Perl_utf8n_to_uvuni(aTHX_ s, send -s, retlen,
+                              ckWARN_d(WARN_UTF8) ? 0 : UTF8_ALLOW_ANY);
+}
+
+/* Like L</utf8_to_uvuni_buf>(), but should only be called when it is known 
that
+ * there are no malformations in the input UTF-8 string C<s>.  Currently, some
+ * malformations are checked for, but this checking likely will be removed in
+ * the future */
+
+UV
+Perl_valid_utf8_to_uvuni(pTHX_ const U8 *s, STRLEN *retlen)
+{
+    PERL_ARGS_ASSERT_VALID_UTF8_TO_UVUNI;
+
+    return utf8_to_uvuni_buf(s, s + UTF8_MAXBYTES, retlen);
 }
 
 /*
 =for apidoc utf8_to_uvuni
 
+DEPRECATED!
+
 Returns the Unicode code point of the first character in the string C<s>
 which is assumed to be in UTF-8 encoding; C<retlen> will be set to the
 length, in bytes, of that character.
 
-This function should only be used when the returned UV is considered
-an index into the Unicode semantic tables (e.g. swashes).
+Some, but not all, UTF-8 malformations are detected, and in fact, some
+malformed input could cause reading beyond the end of the input buffer, which
+is why this function is deprecated.  Use L</utf8_to_uvuni_buf> instead.
 
-If C<s> does not point to a well-formed UTF-8 character, zero is
-returned and retlen is set, if possible, to -1.
+If C<s> points to one of the detected malformations, zero is
+returned and C<retlen> is set, if possible, to -1.
 
 =cut
 */
@@ -836,9 +924,7 @@ Perl_utf8_to_uvuni(pTHX_ const U8 *s, STRLEN *retlen)
 {
     PERL_ARGS_ASSERT_UTF8_TO_UVUNI;
 
-    /* Call the low level routine asking for checks */
-    return Perl_utf8n_to_uvuni(aTHX_ s, UTF8_MAXBYTES, retlen,
-                              ckWARN_d(WARN_UTF8) ? 0 : UTF8_ALLOW_ANY);
+    return valid_utf8_to_uvuni(s, retlen);
 }
 
 /*
@@ -946,8 +1032,8 @@ Perl_utf8_hop(pTHX_ const U8 *s, I32 off)
 /*
 =for apidoc bytes_cmp_utf8
 
-Compares the sequence of characters (stored as octets) in b, blen with the
-sequence of characters (stored as UTF-8) in u, ulen. Returns 0 if they are
+Compares the sequence of characters (stored as octets) in C<b>, C<blen> with 
the
+sequence of characters (stored as UTF-8) in C<u>, C<ulen>. Returns 0 if they 
are
 equal, -1 or -2 if the first string is less than the second string, +1 or +2
 if the first string is greater than the second string.
 
@@ -1015,11 +1101,11 @@ Perl_bytes_cmp_utf8(pTHX_ const U8 *b, STRLEN blen, 
const U8 *u, STRLEN ulen)
 =for apidoc utf8_to_bytes
 
 Converts a string C<s> of length C<len> from UTF-8 into native byte encoding.
-Unlike C<bytes_to_utf8>, this over-writes the original string, and
-updates len to contain the new length.
+Unlike L</bytes_to_utf8>, this over-writes the original string, and
+updates C<len> to contain the new length.
 Returns zero on failure, setting C<len> to -1.
 
-If you need a copy of the string, see C<bytes_from_utf8>.
+If you need a copy of the string, see L</bytes_from_utf8>.
 
 =cut
 */
@@ -1048,7 +1134,7 @@ Perl_utf8_to_bytes(pTHX_ U8 *s, STRLEN *len)
     d = s = save;
     while (s < send) {
         STRLEN ulen;
-        *d++ = (U8)utf8_to_uvchr(s, &ulen);
+        *d++ = (U8)utf8_to_uvchr_buf(s, send, &ulen);
         s += ulen;
     }
     *d = '\0';
@@ -1060,7 +1146,7 @@ Perl_utf8_to_bytes(pTHX_ U8 *s, STRLEN *len)
 =for apidoc bytes_from_utf8
 
 Converts a string C<s> of length C<len> from UTF-8 into native byte encoding.
-Unlike C<utf8_to_bytes> but like C<bytes_to_utf8>, returns a pointer to
+Unlike L</utf8_to_bytes> but like L</bytes_to_utf8>, returns a pointer to
 the newly-created string, and updates C<len> to contain the new
 length.  Returns the original string if no conversion occurs, C<len>
 is unchanged. Do nothing if C<is_utf8> points to 0. Sets C<is_utf8> to
@@ -1125,7 +1211,7 @@ A NUL character will be written after the end of the 
string.
 
 If you want to convert to UTF-8 from encodings other than
 the native (Latin1 or EBCDIC),
-see sv_recode_to_utf8().
+see L</sv_recode_to_utf8>().
 
 =cut
 */
@@ -1426,9 +1512,9 @@ Perl_to_uni_upper(pTHX_ UV c, U8* p, STRLEN *lenp)
 {
     dVAR;
 
-    /* Convert the Unicode character whose ordinal is c to its uppercase
-     * version and store that in UTF-8 in p and its length in bytes in lenp.
-     * Note that the p needs to be at least UTF8_MAXBYTES_CASE+1 bytes since
+    /* Convert the Unicode character whose ordinal is <c> to its uppercase
+     * version and store that in UTF-8 in <p> and its length in bytes in 
<lenp>.
+     * Note that the <p> needs to be at least UTF8_MAXBYTES_CASE+1 bytes since
      * the changed version may be longer than the original character.
      *
      * The ordinal of the first character of the changed version is returned
@@ -1464,7 +1550,7 @@ S_to_lower_latin1(pTHX_ const U8 c, U8* p, STRLEN *lenp)
 {
     /* We have the latin1-range values compiled into the core, so just use
      * those, converting the result to utf8.  Since the result is always just
-     * one character, we allow p to be NULL */
+     * one character, we allow <p> to be NULL */
 
     U8 converted = toLOWER_LATIN1(c);
 
@@ -1500,7 +1586,7 @@ Perl_to_uni_lower(pTHX_ UV c, U8* p, STRLEN *lenp)
 UV
 Perl__to_fold_latin1(pTHX_ const U8 c, U8* p, STRLEN *lenp, const bool flags)
 {
-    /* Corresponds to to_lower_latin1(), flags is TRUE if to use full case
+    /* Corresponds to to_lower_latin1(), <flags> is TRUE if to use full case
      * folding */
 
     UV converted;
@@ -2044,24 +2130,25 @@ Perl__is_utf8_quotemeta(pTHX_ const U8 *p)
 /*
 =for apidoc to_utf8_case
 
-The "p" contains the pointer to the UTF-8 string encoding
-the character that is being converted.
+The C<p> contains the pointer to the UTF-8 string encoding
+the character that is being converted.  This routine assumes that the character
+at C<p> is well-formed.
 
-The "ustrp" is a pointer to the character buffer to put the
-conversion result to.  The "lenp" is a pointer to the length
+The C<ustrp> is a pointer to the character buffer to put the
+conversion result to.  The C<lenp> is a pointer to the length
 of the result.
 
-The "swashp" is a pointer to the swash to use.
+The C<swashp> is a pointer to the swash to use.
 
-Both the special and normal mappings are stored in lib/unicore/To/Foo.pl,
-and loaded by SWASHNEW, using lib/utf8_heavy.pl.  The special (usually,
+Both the special and normal mappings are stored in F<lib/unicore/To/Foo.pl>,
+and loaded by SWASHNEW, using F<lib/utf8_heavy.pl>.  The C<special> (usually,
 but not always, a multicharacter mapping), is tried first.
 
-The "special" is a string like "utf8::ToSpecLower", which means the
+The C<special> is a string like "utf8::ToSpecLower", which means the
 hash %utf8::ToSpecLower.  The access to the hash is through
 Perl_to_utf8_case().
 
-The "normal" is a string like "ToLower" which means the swash
+The C<normal> is a string like "ToLower" which means the swash
 %utf8::ToLower.
 
 =cut */
@@ -2073,7 +2160,7 @@ Perl_to_utf8_case(pTHX_ const U8 *p, U8* ustrp, STRLEN 
*lenp,
     dVAR;
     U8 tmpbuf[UTF8_MAXBYTES_CASE+1];
     STRLEN len = 0;
-    const UV uv0 = utf8_to_uvchr(p, NULL);
+    const UV uv0 = valid_utf8_to_uvchr(p, NULL);
     /* The NATIVE_TO_UNI() and UNI_TO_NATIVE() mappings
      * are necessary in EBCDIC, they are redundant no-ops
      * in ASCII-ish platforms, and hopefully optimized away. */
@@ -2186,7 +2273,8 @@ S_check_locale_boundary_crossing(pTHX_ const U8* const p, 
const UV result, U8* c
      * contains a character that crosses the 255/256 boundary, disallow the
      * change, and return the original code point.  See L<perlfunc/lc> for why;
      *
-     * p       points to the original string whose case was changed
+     * p       points to the original string whose case was changed; assumed
+     *          by this routine to be well-formed
      * result  the code point of the first character in the changed-case string
      * ustrp   points to the changed-case string (<result> represents its 
first char)
      * lenp    points to the length of <ustrp> */
@@ -2220,7 +2308,7 @@ S_check_locale_boundary_crossing(pTHX_ const U8* const p, 
const UV result, U8* c
 bad_crossing:
 
     /* Failed, have to return the original */
-    original = utf8_to_uvchr(p, lenp);
+    original = valid_utf8_to_uvchr(p, lenp);
     Copy(p, ustrp, *lenp, char);
     return original;
 }
@@ -2228,14 +2316,16 @@ bad_crossing:
 /*
 =for apidoc to_utf8_upper
 
-Convert the UTF-8 encoded character at p to its uppercase version and
-store that in UTF-8 in ustrp and its length in bytes in lenp.  Note
+Convert the UTF-8 encoded character at C<p> to its uppercase version and
+store that in UTF-8 in C<ustrp> and its length in bytes in C<lenp>.  Note
 that the ustrp needs to be at least UTF8_MAXBYTES_CASE+1 bytes since
 the uppercase version may be longer than the original character.
 
 The first character of the uppercased version is returned
 (but note, as explained above, that there may be more.)
 
+The character at C<p> is assumed by this routine to be well-formed.
+
 =cut */
 
 /* Not currently externally documented, and subject to change:
@@ -2298,14 +2388,16 @@ Perl__to_utf8_upper_flags(pTHX_ const U8 *p, U8* ustrp, 
STRLEN *lenp, const bool
 /*
 =for apidoc to_utf8_title
 
-Convert the UTF-8 encoded character at p to its titlecase version and
-store that in UTF-8 in ustrp and its length in bytes in lenp.  Note
-that the ustrp needs to be at least UTF8_MAXBYTES_CASE+1 bytes since the
+Convert the UTF-8 encoded character at C<p> to its titlecase version and
+store that in UTF-8 in C<ustrp> and its length in bytes in C<lenp>.  Note
+that the C<ustrp> needs to be at least UTF8_MAXBYTES_CASE+1 bytes since the
 titlecase version may be longer than the original character.
 
 The first character of the titlecased version is returned
 (but note, as explained above, that there may be more.)
 
+The character at C<p> is assumed by this routine to be well-formed.
+
 =cut */
 
 /* Not currently externally documented, and subject to change:
@@ -2370,14 +2462,16 @@ Perl__to_utf8_title_flags(pTHX_ const U8 *p, U8* ustrp, 
STRLEN *lenp, const bool
 /*
 =for apidoc to_utf8_lower
 
-Convert the UTF-8 encoded character at p to its lowercase version and
-store that in UTF-8 in ustrp and its length in bytes in lenp.  Note
-that the ustrp needs to be at least UTF8_MAXBYTES_CASE+1 bytes since the
+Convert the UTF-8 encoded character at C<p> to its lowercase version and
+store that in UTF-8 in ustrp and its length in bytes in C<lenp>.  Note
+that the C<ustrp> needs to be at least UTF8_MAXBYTES_CASE+1 bytes since the
 lowercase version may be longer than the original character.
 
 The first character of the lowercased version is returned
 (but note, as explained above, that there may be more.)
 
+The character at C<p> is assumed by this routine to be well-formed.
+
 =cut */
 
 /* Not currently externally documented, and subject to change:
@@ -2441,15 +2535,17 @@ Perl__to_utf8_lower_flags(pTHX_ const U8 *p, U8* ustrp, 
STRLEN *lenp, const bool
 /*
 =for apidoc to_utf8_fold
 
-Convert the UTF-8 encoded character at p to its foldcase version and
-store that in UTF-8 in ustrp and its length in bytes in lenp.  Note
-that the ustrp needs to be at least UTF8_MAXBYTES_CASE+1 bytes since the
+Convert the UTF-8 encoded character at C<p> to its foldcase version and
+store that in UTF-8 in C<ustrp> and its length in bytes in C<lenp>.  Note
+that the C<ustrp> needs to be at least UTF8_MAXBYTES_CASE+1 bytes since the
 foldcase version may be longer than the original character (up to
 three characters).
 
 The first character of the foldcased version is returned
 (but note, as explained above, that there may be more.)
 
+The character at C<p> is assumed by this routine to be well-formed.
+
 =cut */
 
 /* Not currently externally documented, and subject to change,
@@ -3418,7 +3514,7 @@ Perl__swash_inversion_hash(pTHX_ SV* const swash)
                           "unexpectedly is not a string, flags=%lu",
                           (unsigned long)SvFLAGS(sv_to));
            }
-           /*DEBUG_U(PerlIO_printf(Perl_debug_log, "Found mapping from 
%"UVXf", First char of to is %"UVXf"\n", utf8_to_uvchr((U8*) char_from, 0), 
utf8_to_uvchr((U8*) SvPVX(sv_to), 0)));*/
+           /*DEBUG_U(PerlIO_printf(Perl_debug_log, "Found mapping from 
%"UVXf", First char of to is %"UVXf"\n", valid_utf8_to_uvchr((U8*) char_from, 
0), valid_utf8_to_uvchr((U8*) SvPVX(sv_to), 0)));*/
 
            /* Each key in the inverse list is a mapped-to value, and the key's
             * hash value is a list of the strings (each in utf8) that map to
@@ -3485,7 +3581,7 @@ Perl__swash_inversion_hash(pTHX_ SV* const swash)
                        Perl_croak(aTHX_ "panic: hv_store() unexpectedly 
failed");
                    }
 
-                   /* For debugging: UV u = utf8_to_uvchr((U8*) 
SvPVX(*entryp), 0);*/
+                   /* For debugging: UV u = valid_utf8_to_uvchr((U8*) 
SvPVX(*entryp), 0);*/
                    for (j = 0; j <= av_len(from_list); j++) {
                        entryp = av_fetch(from_list, j, FALSE);
                        if (entryp == NULL) {
@@ -3493,9 +3589,11 @@ Perl__swash_inversion_hash(pTHX_ SV* const swash)
                        }
 
                        /* When i==j this adds itself to the list */
-                       av_push(i_list, newSVuv(utf8_to_uvchr(
-                                               (U8*) SvPVX(*entryp), 0)));
-                       /*DEBUG_U(PerlIO_printf(Perl_debug_log, "Adding %"UVXf" 
to list for %"UVXf"\n", utf8_to_uvchr((U8*) SvPVX(*entryp), 0), u));*/
+                       av_push(i_list, newSVuv(utf8_to_uvchr_buf(
+                                       (U8*) SvPVX(*entryp),
+                                       (U8*) SvPVX(*entryp) + SvCUR(*entryp),
+                                       0)));
+                       /*DEBUG_U(PerlIO_printf(Perl_debug_log, "Adding %"UVXf" 
to list for %"UVXf"\n", valid_utf8_to_uvchr((U8*) SvPVX(*entryp), 0), u));*/
                    }
                }
            }
@@ -3800,7 +3898,7 @@ C<s>
 which is assumed to be in UTF-8 encoding; C<retlen> will be set to the
 length, in bytes, of that character.
 
-length and flags are the same as utf8n_to_uvuni().
+C<length> and C<flags> are the same as L</utf8n_to_uvuni>().
 
 =cut
 */
@@ -3841,7 +3939,7 @@ Perl_check_utf8_print(pTHX_ register const U8* s, const 
STRLEN len)
            STRLEN char_len;
            if (UTF8_IS_SUPER(s)) {
                if (ckWARN_d(WARN_NON_UNICODE)) {
-                   UV uv = utf8_to_uvchr(s, &char_len);
+                   UV uv = utf8_to_uvchr_buf(s, e, &char_len);
                    Perl_warner(aTHX_ packWARN(WARN_NON_UNICODE),
                        "Code point 0x%04"UVXf" is not Unicode, may not be 
portable", uv);
                    ok = FALSE;
@@ -3849,7 +3947,7 @@ Perl_check_utf8_print(pTHX_ register const U8* s, const 
STRLEN len)
            }
            else if (UTF8_IS_SURROGATE(s)) {
                if (ckWARN_d(WARN_SURROGATE)) {
-                   UV uv = utf8_to_uvchr(s, &char_len);
+                   UV uv = utf8_to_uvchr_buf(s, e, &char_len);
                    Perl_warner(aTHX_ packWARN(WARN_SURROGATE),
                        "Unicode surrogate U+%04"UVXf" is illegal in UTF-8", 
uv);
                    ok = FALSE;
@@ -3859,7 +3957,7 @@ Perl_check_utf8_print(pTHX_ register const U8* s, const 
STRLEN len)
                ((UTF8_IS_NONCHAR_GIVEN_THAT_NON_SUPER_AND_GE_PROBLEMATIC(s))
                 && (ckWARN_d(WARN_NONCHAR)))
            {
-               UV uv = utf8_to_uvchr(s, &char_len);
+               UV uv = utf8_to_uvchr_buf(s, e, &char_len);
                Perl_warner(aTHX_ packWARN(WARN_NONCHAR),
                    "Unicode non-character U+%04"UVXf" is illegal for open 
interchange", uv);
                ok = FALSE;
@@ -3874,18 +3972,18 @@ Perl_check_utf8_print(pTHX_ register const U8* s, const 
STRLEN len)
 /*
 =for apidoc pv_uni_display
 
-Build to the scalar dsv a displayable version of the string spv,
-length len, the displayable version being at most pvlim bytes long
+Build to the scalar C<dsv> a displayable version of the string C<spv>,
+length C<len>, the displayable version being at most C<pvlim> bytes long
 (if longer, the rest is truncated and "..." will be appended).
 
-The flags argument can have UNI_DISPLAY_ISPRINT set to display
+The C<flags> argument can have UNI_DISPLAY_ISPRINT set to display
 isPRINT()able characters as themselves, UNI_DISPLAY_BACKSLASH
 to display the \\[nrfta\\] as the backslashed versions (like '\n')
 (UNI_DISPLAY_BACKSLASH is preferred over UNI_DISPLAY_ISPRINT for \\).
 UNI_DISPLAY_QQ (and its alias UNI_DISPLAY_REGEX) have both
 UNI_DISPLAY_BACKSLASH and UNI_DISPLAY_ISPRINT turned on.
 
-The pointer to the PV of the dsv is returned.
+The pointer to the PV of the C<dsv> is returned.
 
 =cut */
 char *
@@ -3909,7 +4007,7 @@ Perl_pv_uni_display(pTHX_ SV *dsv, const U8 *spv, STRLEN 
len, STRLEN pvlim, UV f
              truncated++;
              break;
         }
-        u = utf8_to_uvchr((U8*)s, 0);
+        u = utf8_to_uvchr_buf((U8*)s, (U8*)e, 0);
         if (u < 256) {
             const unsigned char c = (unsigned char)u & 0xFF;
             if (flags & UNI_DISPLAY_BACKSLASH) {
@@ -3953,13 +4051,13 @@ Perl_pv_uni_display(pTHX_ SV *dsv, const U8 *spv, 
STRLEN len, STRLEN pvlim, UV f
 /*
 =for apidoc sv_uni_display
 
-Build to the scalar dsv a displayable version of the scalar sv,
-the displayable version being at most pvlim bytes long
+Build to the scalar C<dsv> a displayable version of the scalar C<sv>,
+the displayable version being at most C<pvlim> bytes long
 (if longer, the rest is truncated and "..." will be appended).
 
-The flags argument is as in pv_uni_display().
+The C<flags> argument is as in L</pv_uni_display>().
 
-The pointer to the PV of the dsv is returned.
+The pointer to the PV of the C<dsv> is returned.
 
 =cut
 */
@@ -3975,40 +4073,42 @@ Perl_sv_uni_display(pTHX_ SV *dsv, SV *ssv, STRLEN 
pvlim, UV flags)
 /*
 =for apidoc foldEQ_utf8
 
-Returns true if the leading portions of the strings s1 and s2 (either or both
+Returns true if the leading portions of the strings C<s1> and C<s2> (either or 
both
 of which may be in UTF-8) are the same case-insensitively; false otherwise.
 How far into the strings to compare is determined by other input parameters.
 
-If u1 is true, the string s1 is assumed to be in UTF-8-encoded Unicode;
-otherwise it is assumed to be in native 8-bit encoding.  Correspondingly for u2
-with respect to s2.
+If C<u1> is true, the string C<s1> is assumed to be in UTF-8-encoded Unicode;
+otherwise it is assumed to be in native 8-bit encoding.  Correspondingly for 
C<u2>
+with respect to C<s2>.
 
-If the byte length l1 is non-zero, it says how far into s1 to check for fold
-equality.  In other words, s1+l1 will be used as a goal to reach.  The
+If the byte length C<l1> is non-zero, it says how far into C<s1> to check for 
fold
+equality.  In other words, C<s1>+C<l1> will be used as a goal to reach.  The
 scan will not be considered to be a match unless the goal is reached, and
-scanning won't continue past that goal.  Correspondingly for l2 with respect to
-s2.
-
-If pe1 is non-NULL and the pointer it points to is not NULL, that pointer is
-considered an end pointer beyond which scanning of s1 will not continue under
-any circumstances.  This means that if both l1 and pe1 are specified, and pe1
-is less than s1+l1, the match will never be successful because it can never
+scanning won't continue past that goal.  Correspondingly for C<l2> with 
respect to
+C<s2>.
+
+If C<pe1> is non-NULL and the pointer it points to is not NULL, that pointer is
+considered an end pointer beyond which scanning of C<s1> will not continue 
under
+any circumstances.  This means that if both C<l1> and C<pe1> are specified, and
+C<pe1>
+is less than C<s1>+C<l1>, the match will never be successful because it can
+never
 get as far as its goal (and in fact is asserted against).  Correspondingly for
-pe2 with respect to s2.
+C<pe2> with respect to C<s2>.
 
-At least one of s1 and s2 must have a goal (at least one of l1 and l2 must be
-non-zero), and if both do, both have to be
+At least one of C<s1> and C<s2> must have a goal (at least one of C<l1> and
+C<l2> must be non-zero), and if both do, both have to be
 reached for a successful match.   Also, if the fold of a character is multiple
 characters, all of them must be matched (see tr21 reference below for
 'folding').
 
-Upon a successful match, if pe1 is non-NULL,
-it will be set to point to the beginning of the I<next> character of s1 beyond
-what was matched.  Correspondingly for pe2 and s2.
+Upon a successful match, if C<pe1> is non-NULL,
+it will be set to point to the beginning of the I<next> character of C<s1>
+beyond what was matched.  Correspondingly for C<pe2> and C<s2>.
 
 For case-insensitiveness, the "casefolding" of Unicode is used
 instead of upper/lowercasing both the characters, see
-http://www.unicode.org/unicode/reports/tr21/ (Case Mappings).
+L<http://www.unicode.org/unicode/reports/tr21/> (Case Mappings).
 
 =cut */
 

--
Perl5 Master Repository

Reply via email to