In perl.git, the branch smoke-me/khw-smoke has been created
<http://perl5.git.perl.org/perl.git/commitdiff/3573d6ac0bd9c3c349ac7158b577598e756bcc67?hp=0000000000000000000000000000000000000000>
at 3573d6ac0bd9c3c349ac7158b577598e756bcc67 (commit)
- Log -----------------------------------------------------------------
commit 3573d6ac0bd9c3c349ac7158b577598e756bcc67
Author: Karl Williamson <[email protected]>
Date: Wed Apr 18 22:14:15 2012 -0600
is_utf8_char_slow(): Avoid accepting overlongs
There are possible overlong sequences that this function blindly
accepts. Instead of developing the code to figure this out, turn this
function into a wrapper for utf8n_to_uvuni() which already has this
check.
M utf8.c
commit 357463d0529417b16487f3fa97aca4f3a4e7287a
Author: Karl Williamson <[email protected]>
Date: Wed Apr 18 18:32:57 2012 -0600
perlapi: Update for changes in utf8 decoding
M utf8.c
commit 8d338e466a95aa08242140e80291dc4127ea9e7b
Author: Karl Williamson <[email protected]>
Date: Mon Apr 23 13:28:32 2012 -0600
utf8.c: White-space only
This outdents to account for the removal of a surrounding block.
M utf8.c
commit 346c2b9e7b032aee87ad1744f38be11ff98bf3b7
Author: Karl Williamson <[email protected]>
Date: Wed Apr 18 17:36:01 2012 -0600
utf8.c: refactor utf8n_to_uvuni()
The prior version had a number of issues, some of which have been taken
care of in previous commits; and some are left to do.
The goal when presented with malformed input is to consume as few bytes
as possible so as to position the input for the next try to the first
possible byte that could be the beginning of a character. We don't want
to consume too few bytes, so that the next call has us thinking that
what is the middle of a character is really the beginning; nor do we
want to consume too many, so as to skip valid input characters.
The previous code could do both of these in various circumstances.
In some cases it believed that the first byte in a character is correct,
and skipped looking at the rest of the bytes in the sequence. This is
wrong when just that first byte is garbled. We have to look at all
bytes in the expected sequence to make sure it hasn't been prematurely
terminated.
Likewise when we get an overflow: we have to keep looking at each byte
in the alleged sequence. It may be that the initial byte was garbled to
give us an apparent large number, but the actual sequence is shorter
than expected, and there really wouldn't have been an overflow. We
want to position the pointer for the next call to be the beginning of
the next potentially good character.
This fixes a long-standing TODO from an externally supplied utf8 decode
test suite.
It is unclear that the old algorithm for finding overflow catches all
such cases. This now uses an algorithm suggested by Hugo van der Sanden
that should work in all instances.
Another bug is that the code was careless about what happens when an
allowed malformation happens. For example, a sequence should not start
with a continuation byte. If that malformation is allowed, the code
pretends it is a start byte and extracts the length of the sequence from
that. But pretending it is a start byte is not the same thing as it
being a start byte, and that extracted length is bogus.
Yet another bug fixed is that the utf8 warning category had to have been
turned on to get warnings that should have been raised when only the
surrogate, non_unicode, or nonchar categories were on.
And yet another change is that Given malformed input with no warnings,
this function used to return whatever it had computed so far. But this
is really invalid garbage. Return the REPLACEMENT CHARACTER instead.
Thanks to Hugo van der Sanden for reviewing and finding problems with an
earlier version of these commits
M Porting/perl5160delta.pod
M ext/XS-APItest/APItest.xs
M ext/XS-APItest/t/utf8.t
M t/op/utf8decode.t
M utf8.c
M utf8.h
commit 98bfaebc907eddb984f38ad9b0141efc2c5baf7f
Author: Karl Williamson <[email protected]>
Date: Wed Apr 18 16:48:29 2012 -0600
utf8n_to_uvuni: Avoid reading outside of buffer
Prior to this patch, if the first byte of a UTF-8 sequence indicated
that the sequence occupied n bytes, but the input parameters indicated
that fewer were available, all n were attempted to be read
M utf8.c
commit 6901a157209bea699058486a6890f4294cddcf92
Author: Karl Williamson <[email protected]>
Date: Wed Apr 18 16:35:39 2012 -0600
utf8.c: Clarify and correct pod
Some of these were spotted by Hugo van der Sanden
M utf8.c
commit 0a28edba0bc3c8ddc9a4731c0ce3b06ccd760623
Author: Karl Williamson <[email protected]>
Date: Wed Apr 18 16:20:22 2012 -0600
utf8.c: Use macros instead of if..else.. sequence
There are two existing macros that do the job that this longish sequence
does. One, UTF8SKIP(), does an array lookup and is very likely to be in
the machine's cache as it is used ubiquitously when processing UTF-8.
The other is a simple test and shift. These simplify the code and
should speed things up as well.
M utf8.c
commit 1fd2908e3a9a48beda2d786ba87ff81f8c33d78f
Author: Karl Williamson <[email protected]>
Date: Wed Apr 18 15:25:28 2012 -0600
utf8.h: Use correct definition of start byte
The previous definition allowed for (illegal) overlongs. The uses of
this macro in the core assume that it is accurate. The inacurracy can
cause such code to fail.
M utf8.h
M utfebcdic.h
commit 16c0c791248794ee14fc967351111365eebcda06
Author: Christian Hansen <[email protected]>
Date: Wed Apr 18 14:32:16 2012 -0600
utf8.h: Use correct UTF-8 downgradeable definition
Previously, the macro changed by this commit would accept overlong
sequences.
This patch was changed by the committer to to include EBCDIC changes;
and in the non-EBCDIC case, to save a test, by using a mask instead, in
keeping with the prior version of the code
M AUTHORS
M t/op/print.t
M utf8.h
M utfebcdic.h
commit 3ee07b2fb21b981e0906c98a54cbdeb468641bf5
Author: Brian Fraser <[email protected]>
Date: Fri Apr 20 22:09:56 2012 -0300
Make unicode label tests use unicode_eval.
A recent change exposed a faulty test, in t/uni/labels.t;
Previously, a downgraded label passed to eval under 'use utf8;'
would've been erroneously considered UTF-8 and the tests
would pass. Now it's correctly reported as illegal UTF-8
unless unicode_eval is in effect.
M t/uni/labels.t
-----------------------------------------------------------------------
--
Perl5 Master Repository