In perl.git, the branch blead has been updated <http://perl5.git.perl.org/perl.git/commitdiff/cd7e6c884f038d4463b1c4768533b484e5c5c919?hp=906024c7fead4086ed911b8052d784aa07c2f1e2>
- Log ----------------------------------------------------------------- commit cd7e6c884f038d4463b1c4768533b484e5c5c919 Author: Karl Williamson <[email protected]> Date: Wed Apr 18 22:14:15 2012 -0600 is_utf8_char_slow(): Avoid accepting overlongs There are possible overlong sequences that this function blindly accepts. Instead of developing the code to figure this out, turn this function into a wrapper for utf8n_to_uvuni() which already has this check. M utf8.c commit 524080c4d32ea2975130ce2ce31f3b3d508bf140 Author: Karl Williamson <[email protected]> Date: Wed Apr 18 18:32:57 2012 -0600 perlapi: Update for changes in utf8 decoding M utf8.c commit f555bc63534ca05176eb37540a96c0e644dbadff Author: Karl Williamson <[email protected]> Date: Mon Apr 23 13:28:32 2012 -0600 utf8.c: White-space only This outdents to account for the removal of a surrounding block. M utf8.c commit eb83ed87110e41de6a4cd4463f75df60798a9243 Author: Karl Williamson <[email protected]> Date: Wed Apr 18 17:36:01 2012 -0600 utf8.c: refactor utf8n_to_uvuni() The prior version had a number of issues, some of which have been taken care of in previous commits. The goal when presented with malformed input is to consume as few bytes as possible, so as to position the input for the next try to the first possible byte that could be the beginning of a character. We don't want to consume too few bytes, so that the next call has us thinking that what is the middle of a character is really the beginning; nor do we want to consume too many, so as to skip valid input characters. (This is forbidden by the Unicode standard because of security considerations.) The previous code could do both of these under various circumstances. In some cases it took as a given that the first byte in a character is correct, and skipped looking at the rest of the bytes in the sequence. This is wrong when just that first byte is garbled. We have to look at all bytes in the expected sequence to make sure it hasn't been prematurely terminated from what we were led to expect by that first byte. Likewise when we get an overflow: we have to keep looking at each byte in the sequence. It may be that the initial byte was garbled, so that it appeared that there was going to be overflow, but in reality, the input was supposed to be a shorter sequence that doesn't overflow. We want to have an error on that shorter sequence, and advance the pointer to just beyond it, which is the first position where a valid character could start. This fixes a long-standing TODO from an externally supplied utf8 decode test suite. And, the old algorithm for finding overflow failed to detect it on some inputs. This was spotted by Hugo van der Sanden, who suggested the new algorithm that this commit uses, and which should work in all instances. For example, on a 32-bit machine, any string beginning with "\xFE" and having the next byte be either "\x86" or \x87 overflows, but this was missed by the old algorithm. Another bug was that the code was careless about what happens when a malformation occurs that the input flags allow. For example, a sequence should not start with a continuation byte. If that malformation is allowed, the code pretended it is a start byte and extracts the "length" of the sequence from it. But pretending it is a start byte is not the same thing as it actually being a start byte, and so there is no extractable length in it, so the number that this code thought was "length" was bogus. Yet another bug fixed is that if only the warning subcategories of the utf8 category were turned on, and not the entire utf8 category itself, warnings were not raised that should have been. And yet another change is that given malformed input with warnings turned off, this function used to return whatever it had computed so far, which is incomplete or erroneous garbage. This commit changes to return the REPLACEMENT CHARACTER instead. Thanks to Hugo van der Sanden for reviewing and finding problems with an earlier version of these commits M Porting/perl5160delta.pod M ext/XS-APItest/APItest.xs M ext/XS-APItest/t/utf8.t M t/op/utf8decode.t M utf8.c M utf8.h commit 0b8d30e8ba4bed9219a0a08549fd9d07661587ee Author: Karl Williamson <[email protected]> Date: Wed Apr 18 16:48:29 2012 -0600 utf8n_to_uvuni: Avoid reading outside of buffer Prior to this patch, if the first byte of a UTF-8 sequence indicated that the sequence occupied n bytes, but the input parameters indicated that fewer were available, all n were attempted to be read M utf8.c commit 746afd533cc96b75c8a3c821291822f0c0ce7e2a Author: Karl Williamson <[email protected]> Date: Wed Apr 18 16:35:39 2012 -0600 utf8.c: Clarify and correct pod Some of these were spotted by Hugo van der Sanden M utf8.c commit 99ee1dcd0469086e91a96e31a9b9ea27bb7f0c7e Author: Karl Williamson <[email protected]> Date: Wed Apr 18 16:20:22 2012 -0600 utf8.c: Use macros instead of if..else.. sequence There are two existing macros that do the job that this longish sequence does. One, UTF8SKIP(), does an array lookup and is very likely to be in the machine's cache as it is used ubiquitously when processing UTF-8. The other is a simple test and shift. These simplify the code and should speed things up as well. M utf8.c commit 0447e8df3db3f566f76a613f62c5f4cdd7262997 Author: Karl Williamson <[email protected]> Date: Wed Apr 18 15:25:28 2012 -0600 utf8.h: Use correct definition of start byte The previous definition allowed for (illegal) overlongs. The uses of this macro in the core assume that it is accurate. The inacurracy can cause such code to fail. M utf8.h M utfebcdic.h commit 0ae1fa71a437dfa435b139674610ec992366d661 Author: Christian Hansen <[email protected]> Date: Wed Apr 18 14:32:16 2012 -0600 utf8.h: Use correct UTF-8 downgradeable definition Previously, the macro changed by this commit would accept overlong sequences. This patch was changed by the committer to to include EBCDIC changes; and in the non-EBCDIC case, to save a test, by using a mask instead, in keeping with the prior version of the code M AUTHORS M t/op/print.t M utf8.h M utfebcdic.h commit dd53ca2f01b45dd5a54bd2d00709dbfbe00ccdf3 Author: Brian Fraser <[email protected]> Date: Fri Apr 20 22:09:56 2012 -0300 Make unicode label tests use unicode_eval. A recent change exposed a faulty test, in t/uni/labels.t; Previously, a downgraded label passed to eval under 'use utf8;' would've been erroneously considered UTF-8 and the tests would pass. Now it's correctly reported as illegal UTF-8 unless unicode_eval is in effect. M t/uni/labels.t ----------------------------------------------------------------------- Summary of changes: AUTHORS | 1 + Porting/perl5160delta.pod | 10 + ext/XS-APItest/APItest.xs | 35 +++ ext/XS-APItest/t/utf8.t | 223 ++++++++++++++++++++ t/op/print.t | 18 ++- t/op/utf8decode.t | 2 +- t/uni/labels.t | 1 + utf8.c | 514 +++++++++++++++++++++++++++------------------ utf8.h | 14 +- utfebcdic.h | 5 +- 10 files changed, 607 insertions(+), 216 deletions(-) diff --git a/AUTHORS b/AUTHORS index 0369b91..88342aa 100644 --- a/AUTHORS +++ b/AUTHORS @@ -207,6 +207,7 @@ Chris Tubutis <[email protected]> Chris Wick <[email protected]> Chris Williams <[email protected]> Christian Burger <[email protected]> +Christian Hansen <[email protected]> Christian Kirsch <[email protected]> Christian Winter <[email protected]> Christoph Lamprecht <[email protected]> diff --git a/Porting/perl5160delta.pod b/Porting/perl5160delta.pod index 55e218e..5f3be51 100644 --- a/Porting/perl5160delta.pod +++ b/Porting/perl5160delta.pod @@ -1806,6 +1806,16 @@ new C<GV_ADDMG> flag (not part of the API). =item * +The returned code point from the function C<utf8n_to_uvuni()> +when the input is malformed UTF-8, malformations are allowed, and +C<utf8> warnings are off is now the Unicode REPLACEMENT CHARACTER +whenever the malformation is such that no well-defined code point can be +computed. Previously the returned value was essentially garbage. The +only malformations that have well-defined values are a zero-length +string (0 is the return), and overlong UTF-8 sequences. + +=item * + Padlists are now marked C<AvREAL>; i.e., reference-counted. They have always been reference-counted, but were not marked real, because F<pad.c> did its own clean-up, instead of using the usual clean-up code in F<sv.c>. diff --git a/ext/XS-APItest/APItest.xs b/ext/XS-APItest/APItest.xs index 34fbfde..e2d34d9 100644 --- a/ext/XS-APItest/APItest.xs +++ b/ext/XS-APItest/APItest.xs @@ -1127,6 +1127,41 @@ bytes_cmp_utf8(bytes, utf8) OUTPUT: RETVAL +AV * +test_utf8n_to_uvuni(s, len, flags) + + SV *s + SV *len + SV *flags + PREINIT: + STRLEN retlen; + UV ret; + STRLEN slen; + + CODE: + /* Call utf8n_to_uvuni() with the inputs. It always asks for the + * actual length to be returned + * + * Length to assume <s> is; not checked, so could have buffer overflow + */ + RETVAL = newAV(); + sv_2mortal((SV*)RETVAL); + + ret + = utf8n_to_uvuni((U8*) SvPV(s, slen), SvUV(len), &retlen, SvUV(flags)); + + /* Returns the return value in [0]; <retlen> in [1] */ + av_push(RETVAL, newSVuv(ret)); + if (retlen == (STRLEN) -1) { + av_push(RETVAL, newSViv(-1)); + } + else { + av_push(RETVAL, newSVuv(retlen)); + } + + OUTPUT: + RETVAL + MODULE = XS::APItest:Overload PACKAGE = XS::APItest::Overload void diff --git a/ext/XS-APItest/t/utf8.t b/ext/XS-APItest/t/utf8.t index 9ad99f2..b59fb69 100644 --- a/ext/XS-APItest/t/utf8.t +++ b/ext/XS-APItest/t/utf8.t @@ -24,4 +24,227 @@ foreach ([0, '', '', 'empty'], is(bytes_cmp_utf8($right, $left), -$expect, "$desc reversed"); } +# Test uft8n_to_uvuni(). These provide essentially complete code coverage. + +# Copied from utf8.h +my $UTF8_ALLOW_EMPTY = 0x0001; +my $UTF8_ALLOW_CONTINUATION = 0x0002; +my $UTF8_ALLOW_NON_CONTINUATION = 0x0004; +my $UTF8_ALLOW_SHORT = 0x0008; +my $UTF8_ALLOW_LONG = 0x0010; +my $UTF8_DISALLOW_SURROGATE = 0x0020; +my $UTF8_WARN_SURROGATE = 0x0040; +my $UTF8_DISALLOW_NONCHAR = 0x0080; +my $UTF8_WARN_NONCHAR = 0x0100; +my $UTF8_DISALLOW_SUPER = 0x0200; +my $UTF8_WARN_SUPER = 0x0400; +my $UTF8_DISALLOW_FE_FF = 0x0800; +my $UTF8_WARN_FE_FF = 0x1000; +my $UTF8_CHECK_ONLY = 0x2000; + +my $REPLACEMENT = 0xFFFD; + +my @warnings; + +use warnings 'utf8'; +local $SIG{__WARN__} = sub { push @warnings, @_ }; + +# First test the malformations. All these raise category utf8 warnings. +foreach my $test ( + [ "zero length string malformation", "", 0, + $UTF8_ALLOW_EMPTY, 0, 0, + qr/empty string/ + ], + [ "orphan continuation byte malformation", "\x80a", 2, + $UTF8_ALLOW_CONTINUATION, $REPLACEMENT, 1, + qr/unexpected continuation byte/ + ], + [ "premature next character malformation (immediate)", "\xc2a", 2, + $UTF8_ALLOW_NON_CONTINUATION, $REPLACEMENT, 1, + qr/unexpected non-continuation byte.*immediately after start byte/ + ], + [ "premature next character malformation (non-immediate)", "\xf0\x80a", 3, + $UTF8_ALLOW_NON_CONTINUATION, $REPLACEMENT, 2, + qr/unexpected non-continuation byte .* 2 bytes after start byte/ + ], + [ "too short malformation", "\xf0\x80a", 2, + # Having the 'a' after this, but saying there are only 2 bytes also + # tests that we pay attention to the passed in length + $UTF8_ALLOW_SHORT, $REPLACEMENT, 2, + qr/2 bytes, need 4/ + ], + [ "overlong malformation", "\xc1\xaf", 2, + $UTF8_ALLOW_LONG, ord('o'), 2, + qr/2 bytes, need 1/ + ], + [ "overflow malformation", "\xff\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf", 13, + 0, # There is no way to allow this malformation + $REPLACEMENT, 13, + qr/overflow/ + ], +) { + my ($testname, $bytes, $length, $allow_flags, $allowed_uv, $expected_len, $message ) = @$test; + + next if ! ok(length($bytes) >= $length, "$testname: Make sure won't read beyond buffer: " . length($bytes) . " >= $length"); + + # Test what happens when this malformation is not allowed + undef @warnings; + my $ret_ref = test_utf8n_to_uvuni($bytes, $length, 0); + is($ret_ref->[0], 0, "$testname: disallowed: Returns 0"); + is($ret_ref->[1], $expected_len, "$testname: disallowed: Returns expected length"); + if (is(scalar @warnings, 1, "$testname: disallowed: Got a single warning ")) { + like($warnings[0], $message, "$testname: disallowed: Got expected warning"); + } + else { + if (scalar @warnings) { + note "The warnings were: " . join(", ", @warnings); + } + } + + { # Next test when disallowed, and warnings are off. + undef @warnings; + no warnings 'utf8'; + my $ret_ref = test_utf8n_to_uvuni($bytes, $length, 0); + is($ret_ref->[0], 0, "$testname: disallowed: no warnings 'utf8': Returns 0"); + is($ret_ref->[1], $expected_len, "$testname: disallowed: no warnings 'utf8': Returns expected length"); + if (!is(scalar @warnings, 0, "$testname: disallowed: no warnings 'utf8': no warnings generated")) { + note "The warnings were: " . join(", ", @warnings); + } + } + + # Test with CHECK_ONLY + undef @warnings; + $ret_ref = test_utf8n_to_uvuni($bytes, $length, $UTF8_CHECK_ONLY); + is($ret_ref->[0], 0, "$testname: CHECK_ONLY: Returns 0"); + is($ret_ref->[1], -1, "$testname: CHECK_ONLY: returns expected length"); + if (! is(scalar @warnings, 0, "$testname: CHECK_ONLY: no warnings generated")) { + note "The warnings were: " . join(", ", @warnings); + } + + next if $allow_flags == 0; # Skip if can't allow this malformation + + # Test when the malformation is allowed + undef @warnings; + $ret_ref = test_utf8n_to_uvuni($bytes, $length, $allow_flags); + is($ret_ref->[0], $allowed_uv, "$testname: allowed: Returns expected uv"); + is($ret_ref->[1], $expected_len, "$testname: allowed: Returns expected length"); + if (!is(scalar @warnings, 0, "$testname: allowed: no warnings generated")) + { + note "The warnings were: " . join(", ", @warnings); + } +} + +my $FF_ret; + +use Unicode::UCD; +my $has_quad = ($Unicode::UCD::MAX_CP > 0xFFFF_FFFF); +if ($has_quad) { + no warnings qw{portable overflow}; + $FF_ret = 0x1000000000; +} +else { # The above overflows unless a quad platform + $FF_ret = 0; +} + +# Now test the cases where a legal code point is generated, but may or may not +# be allowed/warned on. +foreach my $test ( + [ "surrogate", "\xed\xa4\x8d", + $UTF8_WARN_SURROGATE, $UTF8_DISALLOW_SURROGATE, 'surrogate', 0xD90D, 3, + qr/surrogate/ + ], + [ "non_unicode", "\xf4\x90\x80\x80", + $UTF8_WARN_SUPER, $UTF8_DISALLOW_SUPER, 'non_unicode', 0x110000, 4, + qr/not Unicode/ + ], + [ "non-character code point", "\xEF\xB7\x90", + $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR, 'nonchar', 0xFDD0, 3, + qr/Unicode non-character.*is illegal for open interchange/ + ], + [ "begins with FE", "\xfe\x82\x80\x80\x80\x80\x80", + + # This code point is chosen so that it is representable in a UV on + # 32-bit machines, otherwise we would have to handle it like the FF + # ones + $UTF8_WARN_FE_FF, $UTF8_DISALLOW_FE_FF, 'utf8', 0x80000000, 7, + qr/Code point beginning with byte .* is not Unicode, and not portable/ + ], + [ "begins with FF", "\xff\x80\x80\x80\x80\x80\x81\x80\x80\x80\x80\x80\x80", + $UTF8_WARN_FE_FF, $UTF8_DISALLOW_FE_FF, 'utf8', $FF_ret, 13, + qr/Code point beginning with byte .* is not Unicode, and not portable/ + ], +) { + my ($testname, $bytes, $warn_flags, $disallow_flags, $category, $allowed_uv, $expected_len, $message ) = @$test; + + my $length = length $bytes; + + # This is more complicated than the malformations tested earlier, as there + # are several orthogonal variables involved. We test all the subclasses + # of utf8 warnings to verify they work with and without the utf8 class, + # and don't have effects on other sublass warnings + foreach my $warning (0, 'utf8', 'surrogate', 'nonchar', 'non_unicode') { + foreach my $warn_flag (0, $warn_flags) { + foreach my $disallow_flag (0, $disallow_flags) { + + # On 32-bit machines, anything beginning with \xff is not + # representable, and would overflow even if we were to allow + # them in this test. + next if ! $has_quad + && ! $disallow_flag + && substr($bytes, 0, 1) eq "\xff"; + + no warnings 'utf8'; + my $eval_warn = $warning eq 0 ? "no warnings" : "use warnings '$warning'"; + my $this_name = "$testname: " . (($disallow_flag) ? 'disallowed' : 'allowed'); + $this_name .= ", $eval_warn"; + $this_name .= ", " . (($warn_flag) ? 'with warning flag' : 'no warning flag'); + + undef @warnings; + my $ret_ref; + #note __LINE__ . ": $eval_warn; \$ret_ref = test_utf8n_to_uvuni('$bytes', $length, $warn_flag|$disallow_flag)"; + my $eval_text = "$eval_warn; \$ret_ref = test_utf8n_to_uvuni('$bytes', $length, $warn_flag|$disallow_flag)"; + eval "$eval_text"; + if (! ok ("$@ eq ''", "$this_name: eval succeeded")) { + note "\$!='$!'; eval'd=\"$eval_text\""; + next; + } + if ($disallow_flag) { + is($ret_ref->[0], 0, "$this_name: Returns 0"); + } + else { + is($ret_ref->[0], $allowed_uv, "$this_name: Returns expected uv"); + } + is($ret_ref->[1], $expected_len, "$this_name: Returns expected length"); + + if ($warn_flag && ($warning eq 'utf8' || $warning eq $category)) { + if (is(scalar @warnings, 1, "$this_name: Got a single warning ")) { + like($warnings[0], $message, "$this_name: Got expected warning"); + } + else { + if (scalar @warnings) { + note "The warnings were: " . join(", ", @warnings); + } + } + } + else { + if (!is(scalar @warnings, 0, "$this_name: No warnings generated")) + { + note "The warnings were: " . join(", ", @warnings); + } + } + + if ($disallow_flag) { + undef @warnings; + $ret_ref = test_utf8n_to_uvuni($bytes, $length, $disallow_flag|$UTF8_CHECK_ONLY); + is($ret_ref->[0], 0, "$this_name, CHECK_ONLY: Returns 0"); + is($ret_ref->[1], -1, "$this_name: CHECK_ONLY: returns expected length"); + if (! is(scalar @warnings, 0, "$this_name, CHECK_ONLY: no warnings generated")) { + note "The warnings were: " . join(", ", @warnings); + } + } + } + } + } +} + done_testing; diff --git a/t/op/print.t b/t/op/print.t index 3752251..5e508f6 100644 --- a/t/op/print.t +++ b/t/op/print.t @@ -4,9 +4,25 @@ BEGIN { require "test.pl"; } -plan(2); +plan(3); fresh_perl_is('$_ = qq{OK\n}; print;', "OK\n", 'print without arguments outputs $_'); fresh_perl_is('$_ = qq{OK\n}; print STDOUT;', "OK\n", 'print with only a filehandle outputs $_'); +SKIP: { + skip_if_miniperl('no dynamic loading of PerlIO::scalar in miniperl'); +fresh_perl_is(<<'EOF', "\xC1\xAF\xC1\xAF\xC1\xB0\xC1\xB3", "", "print doesn't launder utf8 overlongs"); +use strict; +use warnings; + +no warnings 'utf8'; + +# These form overlong "oops" +open my $fh, "<:utf8", \"\xC1\xAF\xC1\xAF\xC1\xB0\xC1\xB3" + or die "Could not open\n"; +read($fh, my $s, 10) or die "Could not read\n"; +print $s; +EOF + +} diff --git a/t/op/utf8decode.t b/t/op/utf8decode.t index ba785fa..92de821 100644 --- a/t/op/utf8decode.t +++ b/t/op/utf8decode.t @@ -135,7 +135,7 @@ __DATA__ 3.3.9 n - 4 fb:bf:bf:bf - 4 bytes, need 5 3.3.10 n - 5 fd:bf:bf:bf:bf - 5 bytes, need 6 3.4 Concatenation of incomplete sequences -3.4.1 N-10 - 30 c0:e0:80:f0:80:80:f8:80:80:80:fc:80:80:80:80:df:ef:bf:f7:bf:bf:fb:bf:bf:bf:fd:bf:bf:bf:bf - unexpected non-continuation byte 0xe0, immediately after start byte 0xc0 +3.4.1 N10 - 30 c0:e0:80:f0:80:80:f8:80:80:80:fc:80:80:80:80:df:ef:bf:f7:bf:bf:fb:bf:bf:bf:fd:bf:bf:bf:bf - unexpected non-continuation byte 0xe0, immediately after start byte 0xc0 3.5 Impossible bytes 3.5.1 n - 1 fe - byte 0xfe 3.5.2 n - 1 ff - byte 0xff diff --git a/t/uni/labels.t b/t/uni/labels.t index 1e4d684..3d7d476 100644 --- a/t/uni/labels.t +++ b/t/uni/labels.t @@ -62,6 +62,7 @@ LÃBEL: { utf8::downgrade($prog); } if ($d--) { + use feature 'unicode_eval'; no warnings 'exiting'; eval $prog; } diff --git a/utf8.c b/utf8.c index 8585eb1..c01ea4b 100644 --- a/utf8.c +++ b/utf8.c @@ -277,43 +277,15 @@ five bytes or more. STATIC STRLEN S_is_utf8_char_slow(const U8 *s, const STRLEN len) { - U8 u = *s; - STRLEN slen; - UV uv, ouv; - - PERL_ARGS_ASSERT_IS_UTF8_CHAR_SLOW; - - if (UTF8_IS_INVARIANT(u)) - return len == 1; + dTHX; /* The function called below requires thread context */ - if (!UTF8_IS_START(u)) - return 0; - - if (len < 2 || !UTF8_IS_CONTINUATION(s[1])) - return 0; + STRLEN actual_len; - slen = len - 1; - s++; -#ifdef EBCDIC - u = NATIVE_TO_UTF(u); -#endif - u &= UTF_START_MASK(len); - uv = u; - ouv = uv; - while (slen--) { - if (!UTF8_IS_CONTINUATION(*s)) - return 0; - uv = UTF8_ACCUMULATE(uv, *s); - if (uv < ouv) - return 0; - ouv = uv; - s++; - } + PERL_ARGS_ASSERT_IS_UTF8_CHAR_SLOW; - if ((STRLEN)UNISKIP(uv) < len) - return 0; + utf8n_to_uvuni(s, len, &actual_len, UTF8_CHECK_ONLY); - return len; + return (actual_len == (STRLEN) -1) ? 0 : actual_len; } /* @@ -505,16 +477,16 @@ Perl_is_utf8_string_loclen(const U8 *s, STRLEN len, const U8 **ep, STRLEN *el) =for apidoc utf8n_to_uvuni Bottom level UTF-8 decode routine. -Returns the code point value of the first character in the string C<s> -which is assumed to be in UTF-8 (or UTF-EBCDIC) encoding and no longer than -C<curlen> bytes; C<retlen> will be set to the length, in bytes, of that -character. +Returns the code point value of the first character in the string C<s>, +which is assumed to be in UTF-8 (or UTF-EBCDIC) encoding, and no longer than +C<curlen> bytes; C<*retlen> (if C<retlen> isn't NULL) will be set to +the length, in bytes, of that character. The value of C<flags> determines the behavior when C<s> does not point to a well-formed UTF-8 character. If C<flags> is 0, when a malformation is found, -C<retlen> is set to the expected length of the UTF-8 character in bytes, zero -is returned, and if UTF-8 warnings haven't been lexically disabled, a warning -is raised. +zero is returned and C<*retlen> is set so that (S<C<s> + C<*retlen>>) is the +next possible position in C<s> that could begin a non-malformed character. +Also, if UTF-8 warnings haven't been lexically disabled, a warning is raised. Various ALLOW flags can be set in C<flags> to allow (and not warn on) individual types of malformations, such as the sequence being overlong (that @@ -522,8 +494,10 @@ is, when there is a shorter sequence that can express the same code point; overlong sequences are expressly forbidden in the UTF-8 standard due to potential security issues). Another malformation example is the first byte of a character not being a legal first byte. See F<utf8.h> for the list of such -flags. Of course, the value returned by this function under such conditions is -not reliable. +flags. For allowed 0 length strings, this function returns 0; for allowed +overlong sequences, the computed code point is returned; for all other allowed +malformations, the Unicode REPLACEMENT CHARACTER is returned, as these have no +determinable reasonable value. The UTF8_CHECK_ONLY flag overrides the behavior when a non-allowed (by other flags) malformation is found. If this flag is set, the routine assumes that @@ -531,7 +505,7 @@ the caller will raise a warning, and this function will silently just set C<retlen> to C<-1> and return zero. Certain code points are considered problematic. These are Unicode surrogates, -Unicode non-characters, and code points above the Unicode maximum of 0x10FFF. +Unicode non-characters, and code points above the Unicode maximum of 0x10FFFF. By default these are considered regular code points, but certain situations warrant special handling for them. If C<flags> contains UTF8_DISALLOW_ILLEGAL_INTERCHANGE, all three classes are treated as @@ -549,15 +523,18 @@ UTF8_CHECK_ONLY is also specified.) Very large code points (above 0x7FFF_FFFF) are considered more problematic than the others that are above the Unicode legal maximum. There are several -reasons: they do not fit into a 32-bit word, are not representable on EBCDIC -platforms, and the original UTF-8 specification never went above -this number (the current 0x10FFF limit was imposed later). The UTF-8 encoding -on ASCII platforms for these large code points begins with a byte containing -0xFE or 0xFF. The UTF8_DISALLOW_FE_FF flag will cause them to be treated as -malformations, while allowing smaller above-Unicode code points. (Of course -UTF8_DISALLOW_SUPER will treat all above-Unicode code points, including these, -as malformations.) Similarly, UTF8_WARN_FE_FF acts just like the other WARN -flags, but applies just to these code points. +reasons: they requre at least 32 bits to represent them on ASCII platforms, are +not representable at all on EBCDIC platforms, and the original UTF-8 +specification never went above this number (the current 0x10FFFF limit was +imposed later). (The smaller ones, those that fit into 32 bits, are +representable by a UV on ASCII platforms, but not by an IV, which means that +the number of operations that can be performed on them is quite restricted.) +The UTF-8 encoding on ASCII platforms for these large code points begins with a +byte containing 0xFE or 0xFF. The UTF8_DISALLOW_FE_FF flag will cause them to +be treated as malformations, while allowing smaller above-Unicode code points. +(Of course UTF8_DISALLOW_SUPER will treat all above-Unicode code points, +including these, as malformations.) Similarly, UTF8_WARN_FE_FF acts just like +the other WARN flags, but applies just to these code points. All other code points corresponding to Unicode characters, including private use and those yet to be assigned, are never considered malformed and never @@ -573,225 +550,326 @@ Perl_utf8n_to_uvuni(pTHX_ const U8 *s, STRLEN curlen, STRLEN *retlen, U32 flags) { dVAR; const U8 * const s0 = s; - UV uv = *s, ouv = 0; - STRLEN len = 1; - bool dowarn = ckWARN_d(WARN_UTF8); - const UV startbyte = *s; - STRLEN expectlen = 0; - U32 warning = 0; + U8 overflow_byte = '\0'; /* Save byte in case of overflow */ + U8 * send; + UV uv = *s; + STRLEN expectlen; SV* sv = NULL; + UV outlier_ret = 0; /* return value when input is in error or problematic + */ + UV pack_warn = 0; /* Save result of packWARN() for later */ + bool unexpected_non_continuation = FALSE; + bool overflowed = FALSE; - PERL_ARGS_ASSERT_UTF8N_TO_UVUNI; + const char* const malformed_text = "Malformed UTF-8 character"; -/* This list is a superset of the UTF8_ALLOW_XXX. */ + PERL_ARGS_ASSERT_UTF8N_TO_UVUNI; -#define UTF8_WARN_EMPTY 1 -#define UTF8_WARN_CONTINUATION 2 -#define UTF8_WARN_NON_CONTINUATION 3 -#define UTF8_WARN_SHORT 4 -#define UTF8_WARN_OVERFLOW 5 -#define UTF8_WARN_LONG 6 + /* The order of malformation tests here is important. We should consume as + * few bytes as possible in order to not skip any valid character. This is + * required by the Unicode Standard (section 3.9 of Unicode 6.0); see also + * http://unicode.org/reports/tr36 for more discussion as to why. For + * example, once we've done a UTF8SKIP, we can tell the expected number of + * bytes, and could fail right off the bat if the input parameters indicate + * that there are too few available. But it could be that just that first + * byte is garbled, and the intended character occupies fewer bytes. If we + * blindly assumed that the first byte is correct, and skipped based on + * that number, we could skip over a valid input character. So instead, we + * always examine the sequence byte-by-byte. + * + * We also should not consume too few bytes, otherwise someone could inject + * things. For example, an input could be deliberately designed to + * overflow, and if this code bailed out immediately upon discovering that, + * returning to the caller *retlen pointing to the very next byte (one + * which is actually part of of the overflowing sequence), that could look + * legitimate to the caller, which could discard the initial partial + * sequence and process the rest, inappropriately */ + + /* Zero length strings, if allowed, of necessity are zero */ + if (curlen == 0) { + if (retlen) { + *retlen = 0; + } - if (curlen == 0 && - !(flags & UTF8_ALLOW_EMPTY)) { - warning = UTF8_WARN_EMPTY; + if (flags & UTF8_ALLOW_EMPTY) { + return 0; + } + if (! (flags & UTF8_CHECK_ONLY)) { + sv = sv_2mortal(Perl_newSVpvf(aTHX_ "%s (empty string)", malformed_text)); + } goto malformed; } + expectlen = UTF8SKIP(s); + + /* A well-formed UTF-8 character, as the vast majority of calls to this + * function will be for, has this expected length. For efficiency, set + * things up here to return it. It will be overriden only in those rare + * cases where a malformation is found */ + if (retlen) { + *retlen = expectlen; + } + + /* An invariant is trivially well-formed */ if (UTF8_IS_INVARIANT(uv)) { - if (retlen) - *retlen = 1; return (UV) (NATIVE_TO_UTF(*s)); } - if (UTF8_IS_CONTINUATION(uv) && - !(flags & UTF8_ALLOW_CONTINUATION)) { - warning = UTF8_WARN_CONTINUATION; - goto malformed; - } + /* A continuation character can't start a valid sequence */ + if (UTF8_IS_CONTINUATION(uv)) { + if (flags & UTF8_ALLOW_CONTINUATION) { + if (retlen) { + *retlen = 1; + } + return UNICODE_REPLACEMENT; + } - if (UTF8_IS_START(uv) && curlen > 1 && !UTF8_IS_CONTINUATION(s[1]) && - !(flags & UTF8_ALLOW_NON_CONTINUATION)) { - warning = UTF8_WARN_NON_CONTINUATION; + if (! (flags & UTF8_CHECK_ONLY)) { + sv = sv_2mortal(Perl_newSVpvf(aTHX_ "%s (unexpected continuation byte 0x%02x, with no preceding start byte)", malformed_text, *s0)); + } + curlen = 1; goto malformed; } #ifdef EBCDIC uv = NATIVE_TO_UTF(uv); -#else - if (uv == 0xfe || uv == 0xff) { - if (flags & (UTF8_WARN_SUPER|UTF8_WARN_FE_FF)) { - sv = sv_2mortal(Perl_newSVpvf(aTHX_ "Code point beginning with byte 0x%02"UVXf" is not Unicode, and not portable", uv)); - flags &= ~UTF8_WARN_SUPER; /* Only warn once on this problem */ +#endif + + /* Here is not a continuation byte, nor an invariant. The only thing left + * is a start byte (possibly for an overlong) */ + + /* Remove the leading bits that indicate the number of bytes in the + * character's whole UTF-8 sequence, leaving just the bits that are part of + * the value */ + uv &= UTF_START_MASK(expectlen); + + /* Now, loop through the remaining bytes in the character's sequence, + * accumulating each into the working value as we go. Be sure to not look + * past the end of the input string */ + send = (U8*) s0 + ((expectlen <= curlen) ? expectlen : curlen); + + for (s = s0 + 1; s < send; s++) { + if (UTF8_IS_CONTINUATION(*s)) { +#ifndef EBCDIC /* Can't overflow in EBCDIC */ + if (uv & UTF_ACCUMULATION_OVERFLOW_MASK) { + + /* The original implementors viewed this malformation as more + * serious than the others (though I, khw, don't understand + * why, since other malformations also give very very wrong + * results), so there is no way to turn off checking for it. + * Set a flag, but keep going in the loop, so that we absorb + * the rest of the bytes that comprise the character. */ + overflowed = TRUE; + overflow_byte = *s; /* Save for warning message's use */ + } +#endif + uv = UTF8_ACCUMULATE(uv, *s); + } + else { + /* Here, found a non-continuation before processing all expected + * bytes. This byte begins a new character, so quit, even if + * allowing this malformation. */ + unexpected_non_continuation = TRUE; + break; } - if (flags & (UTF8_DISALLOW_SUPER|UTF8_DISALLOW_FE_FF)) { + } /* End of loop through the character's bytes */ + + /* Save how many bytes were actually in the character */ + curlen = s - s0; + + /* The loop above finds two types of malformations: non-continuation and/or + * overflow. The non-continuation malformation is really a too-short + * malformation, as it means that the current character ended before it was + * expected to (being terminated prematurely by the beginning of the next + * character, whereas in the too-short malformation there just are too few + * bytes available to hold the character. In both cases, the check below + * that we have found the expected number of bytes would fail if executed.) + * Thus the non-continuation malformation is really unnecessary, being a + * subset of the too-short malformation. But there may be existing + * applications that are expecting the non-continuation type, so we retain + * it, and return it in preference to the too-short malformation. (If this + * code were being written from scratch, the two types might be collapsed + * into one.) I, khw, am also giving priority to returning the + * non-continuation and too-short malformations over overflow when multiple + * ones are present. I don't know of any real reason to prefer one over + * the other, except that it seems to me that multiple-byte errors trumps + * errors from a single byte */ + if (unexpected_non_continuation) { + if (!(flags & UTF8_ALLOW_NON_CONTINUATION)) { + if (! (flags & UTF8_CHECK_ONLY)) { + if (curlen == 1) { + sv = sv_2mortal(Perl_newSVpvf(aTHX_ "%s (unexpected non-continuation byte 0x%02x, immediately after start byte 0x%02x)", malformed_text, *s, *s0)); + } + else { + sv = sv_2mortal(Perl_newSVpvf(aTHX_ "%s (unexpected non-continuation byte 0x%02x, %d bytes after start byte 0x%02x, expected %d bytes)", malformed_text, *s, (int) curlen, *s0, (int)expectlen)); + } + } goto malformed; } + uv = UNICODE_REPLACEMENT; + if (retlen) { + *retlen = curlen; + } + } + else if (curlen < expectlen) { + if (! (flags & UTF8_ALLOW_SHORT)) { + if (! (flags & UTF8_CHECK_ONLY)) { + sv = sv_2mortal(Perl_newSVpvf(aTHX_ "%s (%d byte%s, need %d, after start byte 0x%02x)", malformed_text, (int)curlen, curlen == 1 ? "" : "s", (int)expectlen, *s0)); + } + goto malformed; + } + uv = UNICODE_REPLACEMENT; + if (retlen) { + *retlen = curlen; + } } -#endif - - if (!(uv & 0x20)) { len = 2; uv &= 0x1f; } - else if (!(uv & 0x10)) { len = 3; uv &= 0x0f; } - else if (!(uv & 0x08)) { len = 4; uv &= 0x07; } - else if (!(uv & 0x04)) { len = 5; uv &= 0x03; } -#ifdef EBCDIC - else if (!(uv & 0x02)) { len = 6; uv &= 0x01; } - else { len = 7; uv &= 0x01; } -#else - else if (!(uv & 0x02)) { len = 6; uv &= 0x01; } - else if (!(uv & 0x01)) { len = 7; uv = 0; } - else { len = 13; uv = 0; } /* whoa! */ -#endif - - if (retlen) - *retlen = len; - expectlen = len; +#ifndef EBCDIC /* EBCDIC allows FE, FF, can't overflow */ + else if ((*s0 & 0xFE) == 0xFE /* matches FE or FF */ + && (flags & (UTF8_WARN_FE_FF|UTF8_DISALLOW_FE_FF))) + { + /* By adding UTF8_CHECK_ONLY to the test, we avoid unnecessary + * generation of the sv, since no warnings are raised under CHECK */ + if ((flags & (UTF8_WARN_FE_FF|UTF8_CHECK_ONLY)) == UTF8_WARN_FE_FF + && ckWARN_d(WARN_UTF8)) + { + sv = sv_2mortal(Perl_newSVpvf(aTHX_ "%s Code point beginning with byte 0x%02X is not Unicode, and not portable", malformed_text, *s0)); + pack_warn = packWARN(WARN_UTF8); + } + if (flags & UTF8_DISALLOW_FE_FF) { + goto malformed; + } + } + else if (overflowed) { - if ((curlen < expectlen) && - !(flags & UTF8_ALLOW_SHORT)) { - warning = UTF8_WARN_SHORT; + /* If the first byte is FF, it will overflow a 32-bit word. If the + * first byte is FE, it will overflow a signed 32-bit word. The + * above preserves backward compatibility, since its message was used + * in earlier versions of this code in preference to overflow */ + sv = sv_2mortal(Perl_newSVpvf(aTHX_ "%s (overflow at byte 0x%02x, after start byte 0x%02x)", malformed_text, overflow_byte, *s0)); goto malformed; } +#endif - len--; - s++; - ouv = uv; /* ouv is the value from the previous iteration */ - - while (len--) { - if (!UTF8_IS_CONTINUATION(*s) && - !(flags & UTF8_ALLOW_NON_CONTINUATION)) { - s--; - warning = UTF8_WARN_NON_CONTINUATION; - goto malformed; + else if (expectlen > (STRLEN)UNISKIP(uv) && ! (flags & UTF8_ALLOW_LONG)) { + /* The overlong malformation has lower precedence than the others. + * Note that if this malformation is allowed, we return the actual + * value, instead of the replacement character. This is because this + * value is actually well-defined. */ + if (! (flags & UTF8_CHECK_ONLY)) { + sv = sv_2mortal(Perl_newSVpvf(aTHX_ "%s (%d byte%s, need %d, after start byte 0x%02x)", malformed_text, (int)expectlen, expectlen == 1 ? "": "s", UNISKIP(uv), *s0)); } - else - uv = UTF8_ACCUMULATE(uv, *s); - if (!(uv > ouv)) { /* If the value didn't grow from the previous - iteration, something is horribly wrong */ - /* These cannot be allowed. */ - if (uv == ouv) { - if (expectlen != 13 && !(flags & UTF8_ALLOW_LONG)) { - warning = UTF8_WARN_LONG; - goto malformed; - } - } - else { /* uv < ouv */ - /* This cannot be allowed. */ - warning = UTF8_WARN_OVERFLOW; - goto malformed; - } - } - s++; - ouv = uv; + goto malformed; } - if ((expectlen > (STRLEN)UNISKIP(uv)) && !(flags & UTF8_ALLOW_LONG)) { - warning = UTF8_WARN_LONG; - goto malformed; - } else if (flags & (UTF8_DISALLOW_ILLEGAL_INTERCHANGE|UTF8_WARN_ILLEGAL_INTERCHANGE)) { + /* Here, the input is considered to be well-formed , but could be a + * problematic code point that is not allowed by the input parameters. */ + if (uv >= UNICODE_SURROGATE_FIRST /* isn't problematic if < this */ + && (flags & (UTF8_DISALLOW_ILLEGAL_INTERCHANGE + |UTF8_WARN_ILLEGAL_INTERCHANGE))) + { if (UNICODE_IS_SURROGATE(uv)) { - if ((flags & (UTF8_WARN_SURROGATE|UTF8_CHECK_ONLY)) == UTF8_WARN_SURROGATE) { + if ((flags & (UTF8_WARN_SURROGATE|UTF8_CHECK_ONLY)) == UTF8_WARN_SURROGATE + && ckWARN2_d(WARN_UTF8, WARN_SURROGATE)) + { sv = sv_2mortal(Perl_newSVpvf(aTHX_ "UTF-16 surrogate U+%04"UVXf"", uv)); + pack_warn = packWARN2(WARN_UTF8, WARN_SURROGATE); } if (flags & UTF8_DISALLOW_SURROGATE) { goto disallowed; } } else if (UNICODE_IS_NONCHAR(uv)) { - if ((flags & (UTF8_WARN_NONCHAR|UTF8_CHECK_ONLY)) == UTF8_WARN_NONCHAR ) { + if ((flags & (UTF8_WARN_NONCHAR|UTF8_CHECK_ONLY)) == UTF8_WARN_NONCHAR + && ckWARN2_d(WARN_UTF8, WARN_NONCHAR)) + { sv = sv_2mortal(Perl_newSVpvf(aTHX_ "Unicode non-character U+%04"UVXf" is illegal for open interchange", uv)); + pack_warn = packWARN2(WARN_UTF8, WARN_NONCHAR); } if (flags & UTF8_DISALLOW_NONCHAR) { goto disallowed; } } else if ((uv > PERL_UNICODE_MAX)) { - if ((flags & (UTF8_WARN_SUPER|UTF8_CHECK_ONLY)) == UTF8_WARN_SUPER) { + if ((flags & (UTF8_WARN_SUPER|UTF8_CHECK_ONLY)) == UTF8_WARN_SUPER + && ckWARN2_d(WARN_UTF8, WARN_NON_UNICODE)) + { sv = sv_2mortal(Perl_newSVpvf(aTHX_ "Code point 0x%04"UVXf" is not Unicode, may not be portable", uv)); + pack_warn = packWARN2(WARN_UTF8, WARN_NON_UNICODE); } if (flags & UTF8_DISALLOW_SUPER) { goto disallowed; } } + if (sv) { + outlier_ret = uv; + goto do_warn; + } + /* Here, this is not considered a malformed character, so drop through * to return it */ } return uv; -disallowed: /* Is disallowed, but otherwise not malformed. 'sv' will have been - set if there is to be a warning. */ - if (!sv) { - dowarn = 0; - } + /* There are three cases which get to beyond this point. In all 3 cases: + * <sv> if not null points to a string to print as a warning. + * <curlen> is what <*retlen> should be set to if UTF8_CHECK_ONLY isn't + * set. + * <outlier_ret> is what return value to use if UTF8_CHECK_ONLY isn't set. + * This is done by initializing it to 0, and changing it only + * for case 1). + * The 3 cases are: + * 1) The input is valid but problematic, and to be warned about. The + * return value is the resultant code point; <*retlen> is set to + * <curlen>, the number of bytes that comprise the code point. + * <pack_warn> contains the result of packWARN() for the warning + * types. The entry point for this case is the label <do_warn>; + * 2) The input is a valid code point but disallowed by the parameters to + * this function. The return value is 0. If UTF8_CHECK_ONLY is set, + * <*relen> is -1; otherwise it is <curlen>, the number of bytes that + * comprise the code point. <pack_warn> contains the result of + * packWARN() for the warning types. The entry point for this case is + * the label <disallowed>. + * 3) The input is malformed. The return value is 0. If UTF8_CHECK_ONLY + * is set, <*relen> is -1; otherwise it is <curlen>, the number of + * bytes that comprise the malformation. All such malformations are + * assumed to be warning type <utf8>. The entry point for this case + * is the label <malformed>. + */ malformed: + if (sv && ckWARN_d(WARN_UTF8)) { + pack_warn = packWARN(WARN_UTF8); + } + +disallowed: + if (flags & UTF8_CHECK_ONLY) { if (retlen) *retlen = ((STRLEN) -1); return 0; } - if (dowarn) { - if (! sv) { - sv = newSVpvs_flags("Malformed UTF-8 character ", SVs_TEMP); - } +do_warn: - switch (warning) { - case 0: /* Intentionally empty. */ break; - case UTF8_WARN_EMPTY: - sv_catpvs(sv, "(empty string)"); - break; - case UTF8_WARN_CONTINUATION: - Perl_sv_catpvf(aTHX_ sv, "(unexpected continuation byte 0x%02"UVxf", with no preceding start byte)", uv); - break; - case UTF8_WARN_NON_CONTINUATION: - if (s == s0) - Perl_sv_catpvf(aTHX_ sv, "(unexpected non-continuation byte 0x%02"UVxf", immediately after start byte 0x%02"UVxf")", - (UV)s[1], startbyte); - else { - const int len = (int)(s-s0); - Perl_sv_catpvf(aTHX_ sv, "(unexpected non-continuation byte 0x%02"UVxf", %d byte%s after start byte 0x%02"UVxf", expected %d bytes)", - (UV)s[1], len, len > 1 ? "s" : "", startbyte, (int)expectlen); - } - - break; - case UTF8_WARN_SHORT: - Perl_sv_catpvf(aTHX_ sv, "(%d byte%s, need %d, after start byte 0x%02"UVxf")", - (int)curlen, curlen == 1 ? "" : "s", (int)expectlen, startbyte); - expectlen = curlen; /* distance for caller to skip */ - break; - case UTF8_WARN_OVERFLOW: - Perl_sv_catpvf(aTHX_ sv, "(overflow at 0x%"UVxf", byte 0x%02x, after start byte 0x%02"UVxf")", - ouv, *s, startbyte); - break; - case UTF8_WARN_LONG: - Perl_sv_catpvf(aTHX_ sv, "(%d byte%s, need %d, after start byte 0x%02"UVxf")", - (int)expectlen, expectlen == 1 ? "": "s", UNISKIP(uv), startbyte); - break; - default: - sv_catpvs(sv, "(unknown reason)"); - break; - } - - if (sv) { - const char * const s = SvPVX_const(sv); + if (pack_warn) { /* <pack_warn> was initialized to 0, and changed only + if warnings are to be raised. */ + const char * const string = SvPVX_const(sv); - if (PL_op) - Perl_warner(aTHX_ packWARN(WARN_UTF8), - "%s in %s", s, OP_DESC(PL_op)); - else - Perl_warner(aTHX_ packWARN(WARN_UTF8), "%s", s); - } + if (PL_op) + Perl_warner(aTHX_ pack_warn, "%s in %s", string, OP_DESC(PL_op)); + else + Perl_warner(aTHX_ pack_warn, "%s", string); } - if (retlen) - *retlen = expectlen ? expectlen : len; + if (retlen) { + *retlen = curlen; + } - return 0; + return outlier_ret; } /* @@ -799,10 +877,15 @@ malformed: Returns the native code point of the first character in the string C<s> which is assumed to be in UTF-8 encoding; C<send> points to 1 beyond the end of C<s>. -C<retlen> will be set to the length, in bytes, of that character. +C<*retlen> will be set to the length, in bytes, of that character. -If C<s> does not point to a well-formed UTF-8 character, zero is -returned and C<retlen> is set, if possible, to -1. +If C<s> does not point to a well-formed UTF-8 character and UTF8 warnings are +enabled, zero is returned and C<*retlen> is set (if C<retlen> isn't +NULL) to -1. If those warnings are off, the computed value if well-defined (or +the Unicode REPLACEMENT CHARACTER, if not) is silently returned, and C<*retlen> +is set (if C<retlen> isn't NULL) so that (S<C<s> + C<*retlen>>) is the +next possible position in C<s> that could begin a non-malformed character. +See L</utf8n_to_uvuni> for details on when the REPLACEMENT CHARACTER is returned. =cut */ @@ -845,8 +928,13 @@ Some, but not all, UTF-8 malformations are detected, and in fact, some malformed input could cause reading beyond the end of the input buffer, which is why this function is deprecated. Use L</utf8_to_uvchr_buf> instead. -If C<s> points to one of the detected malformations, zero is -returned and C<retlen> is set, if possible, to -1. +If C<s> points to one of the detected malformations, and UTF8 warnings are +enabled, zero is returned and C<*retlen> is set (if C<retlen> isn't +NULL) to -1. If those warnings are off, the computed value if well-defined (or +the Unicode REPLACEMENT CHARACTER, if not) is silently returned, and C<*retlen> +is set (if C<retlen> isn't NULL) so that (S<C<s> + C<*retlen>>) is the +next possible position in C<s> that could begin a non-malformed character. +See L</utf8n_to_uvuni> for details on when the REPLACEMENT CHARACTER is returned. =cut */ @@ -869,8 +957,13 @@ C<retlen> will be set to the length, in bytes, of that character. This function should only be used when the returned UV is considered an index into the Unicode semantic tables (e.g. swashes). -If C<s> does not point to a well-formed UTF-8 character, zero is -returned and C<retlen> is set, if possible, to -1. +If C<s> does not point to a well-formed UTF-8 character and UTF8 warnings are +enabled, zero is returned and C<*retlen> is set (if C<retlen> isn't +NULL) to -1. If those warnings are off, the computed value if well-defined (or +the Unicode REPLACEMENT CHARACTER, if not) is silently returned, and C<*retlen> +is set (if C<retlen> isn't NULL) so that (S<C<s> + C<*retlen>>) is the +next possible position in C<s> that could begin a non-malformed character. +See L</utf8n_to_uvuni> for details on when the REPLACEMENT CHARACTER is returned. =cut */ @@ -916,8 +1009,13 @@ Some, but not all, UTF-8 malformations are detected, and in fact, some malformed input could cause reading beyond the end of the input buffer, which is why this function is deprecated. Use L</utf8_to_uvuni_buf> instead. -If C<s> points to one of the detected malformations, zero is -returned and C<retlen> is set, if possible, to -1. +If C<s> points to one of the detected malformations, and UTF8 warnings are +enabled, zero is returned and C<*retlen> is set (if C<retlen> doesn't point to +NULL) to -1. If those warnings are off, the computed value if well-defined (or +the Unicode REPLACEMENT CHARACTER, if not) is silently returned, and C<*retlen> +is set (if C<retlen> isn't NULL) so that (S<C<s> + C<*retlen>>) is the +next possible position in C<s> that could begin a non-malformed character. +See L</utf8n_to_uvuni> for details on when the REPLACEMENT CHARACTER is returned. =cut */ diff --git a/utf8.h b/utf8.h index e558bb6..8b5700d 100644 --- a/utf8.h +++ b/utf8.h @@ -139,12 +139,12 @@ Perl's extended UTF-8 means we can have start bytes up to FF. */ #define UNI_IS_INVARIANT(c) (((UV)c) < 0x80) -/* Note that C0 and C1 are invalid in legal UTF8, so the lower bound of the - * below might ought to be C2 */ -#define UTF8_IS_START(c) (((U8)c) >= 0xc0) +#define UTF8_IS_START(c) (((U8)c) >= 0xc2) #define UTF8_IS_CONTINUATION(c) (((U8)c) >= 0x80 && (((U8)c) <= 0xbf)) #define UTF8_IS_CONTINUED(c) (((U8)c) & 0x80) -#define UTF8_IS_DOWNGRADEABLE_START(c) (((U8)c & 0xfc) == 0xc0) + +/* Masking with 0xfe allows low bit to be 0 or 1; thus this matches 0xc[23] */ +#define UTF8_IS_DOWNGRADEABLE_START(c) (((U8)c & 0xfe) == 0xc2) #define UTF_START_MARK(len) (((len) > 7) ? 0xFF : (0xFE << (7-(len)))) #define UTF_START_MASK(len) (((len) >= 7) ? 0x00 : (0x1F >> ((len)-2))) @@ -153,6 +153,12 @@ Perl's extended UTF-8 means we can have start bytes up to FF. #define UTF_ACCUMULATION_SHIFT 6 #define UTF_CONTINUATION_MASK ((U8)0x3f) +/* This sets the UTF_CONTINUATION_MASK in the upper bits of a word. If a value + * is anded with it, and the result is non-zero, then using the original value + * in UTF8_ACCUMULATE will overflow, shifting bits off the left */ +#define UTF_ACCUMULATION_OVERFLOW_MASK \ + (((UV) UTF_CONTINUATION_MASK) << ((sizeof(UV) * CHARBITS) - UTF_ACCUMULATION_SHIFT)) + #ifdef HAS_QUAD #define UNISKIP(uv) ( (uv) < 0x80 ? 1 : \ (uv) < 0x800 ? 2 : \ diff --git a/utfebcdic.h b/utfebcdic.h index aa3304b..eff444e 100644 --- a/utfebcdic.h +++ b/utfebcdic.h @@ -584,10 +584,11 @@ END_EXTERN_C #define UNI_IS_INVARIANT(c) ((c) < 0xA0) /* UTF-EBCDIC semantic macros - transform back into I8 and then compare */ -#define UTF8_IS_START(c) (NATIVE_TO_UTF(c) >= 0xA0 && (NATIVE_TO_UTF(c) & 0xE0) != 0xA0) + +#define UTF8_IS_START(c) (NATIVE_TO_UTF(c) >= 0xC5 && NATIVE_TO_UTF(c) != 0xE0) #define UTF8_IS_CONTINUATION(c) ((NATIVE_TO_UTF(c) & 0xE0) == 0xA0) #define UTF8_IS_CONTINUED(c) (NATIVE_TO_UTF(c) >= 0xA0) -#define UTF8_IS_DOWNGRADEABLE_START(c) (NATIVE_TO_UTF(c) >= 0xA0 && (NATIVE_TO_UTF(c) & 0xF8) == 0xC0) +#define UTF8_IS_DOWNGRADEABLE_START(c) (NATIVE_TO_UTF(c) >= 0xC5 && NATIVE_TO_UTF(c) <= 0xC7) #define UTF_START_MARK(len) (((len) > 7) ? 0xFF : ((U8)(0xFE << (7-(len))))) #define UTF_START_MASK(len) (((len) >= 6) ? 0x01 : (0x1F >> ((len)-2))) -- Perl5 Master Repository
