In perl.git, the branch smoke-me/khw-smoke has been created
<http://perl5.git.perl.org/perl.git/commitdiff/fbeb54abd56d9ffbefa5054de2451c5a5250ede4?hp=0000000000000000000000000000000000000000>
at fbeb54abd56d9ffbefa5054de2451c5a5250ede4 (commit)
- Log -----------------------------------------------------------------
commit fbeb54abd56d9ffbefa5054de2451c5a5250ede4
Author: Karl Williamson <[email protected]>
Date: Wed Apr 18 22:14:15 2012 -0600
is_utf8_char_slow(): Avoid accepting overlongs
There are possible overlong sequences that this function blindly
accepts. Instead of developing the code to figure this out, turn this
function into a wrapper for utf8n_to_uvuni() which already has this
check.
M utf8.c
commit af66a93b9215965aa2f5a3ac65e7588b4add027b
Author: Karl Williamson <[email protected]>
Date: Wed Apr 18 18:32:57 2012 -0600
perlapi: Update for changes in utf8 decoding
M utf8.c
commit 2ea42982377368e9b207e924f89766f53203bae1
Author: Karl Williamson <[email protected]>
Date: Wed Apr 18 18:28:55 2012 -0600
utf8n_to_uvuni(): Return REPLACEMENT not garbage
Given malformed input with no warnings, this function used to return
whatever it had computed so far. But this is really invalid garbage.
Return the REPLACEMENT CHARACTER instead.
M Porting/perl5160delta.pod
M utf8.c
commit 7e172ab44ff36d577a2e81e13533ce7d5c9ea018
Author: Karl Williamson <[email protected]>
Date: Wed Apr 18 17:36:01 2012 -0600
utf8.c: refactor utf8n_to_uvuni()
The prior version had a number of issues, some of which have been taken
care of in previous commits; and some are left to do.
The goal when presented with malformed input is to consume as few bytes
as possible so as to position the input for the next try to the first
possible byte that could be the beginning of a character. We don't want
to consume too few bytes, so that the next call has us thinking that
what is the middle of a character is really the beginning; nor do we
want to consume too many, so as to skip valid input characters.
The previous code could do both of these in various circumstances.
In some cases it believed that the first byte in a character is correct,
and skipped looking at the rest of the bytes in the sequence. This is
wrong when just that first byte is garbled. We have to look at all
bytes in the expected sequence to make sure it hasn't been prematurely
terminated.
Likewise when we get an overflow: we have to keep looking at each byte
in the alleged sequence. It may be that the initial byte was garbled to
give us an apparent large number, but the actual sequence is shorter
than expected, and there really wouldn't have been an overflow. We
want to position the pointer for the next call to be the beginning of
the next potentially good character.
This fixes a long-standing TODO from an externally supplied utf8 decode
test suite.
M t/op/utf8decode.t
M utf8.c
commit eeae0eadd4835b5851a0103a75bc4857bc960244
Author: Karl Williamson <[email protected]>
Date: Wed Apr 18 17:19:31 2012 -0600
utf8n_to_uvuni(): Move checking for >32 bit code points
This just moves the code that does the checking if the code point is
non-portable to 32-bit machines to later. These code points aren't
representable at all on EBCDIC platforms.
This will have the effect that on platforms where these aren't
representable, the error returned will be overflow instead of these,
but the moving fixes the problem where the first byte is garbled, and
the input really isn't such a large code point. Prior to this patch,
the first byte is treated as gospel, and the intervening code points
aren't examined, leaving the pointer to the next input byte incorrectly
advanced too far.
M utf8.c
M utf8.h
commit ca3d36aaa355f903345457017e73ad9fa5c7b1c0
Author: Karl Williamson <[email protected]>
Date: Wed Apr 18 16:48:29 2012 -0600
utf8n_to_uvuni: Avoid reading outside of buffer
Prior to this patch, if the first byte of a UTF-8 sequence indicated
that the sequence occupied n bytes, but the input parameters indicated
that fewer were available, all n were attempted to be read
M utf8.c
commit 454b2235a0598c5c2cfc79e95b445bff1ecf84c2
Author: Karl Williamson <[email protected]>
Date: Wed Apr 18 16:35:39 2012 -0600
utf8.c: Clarify pod
M utf8.c
commit ae190ba2c06dd65c5475373b8666296149358e86
Author: Karl Williamson <[email protected]>
Date: Wed Apr 18 16:20:22 2012 -0600
utf8.c: Use macros instead of if..else.. sequence
There are two existing macros that do the job that this longish sequence
does. One, UTF8SKIP(), does an array lookup and is very likely to be in
the machine's cache as it is used ubiquitously when processing UTF-8.
The other is a simple test and shift. These simplify the code and
should speed things up as well.
M utf8.c
commit a6f691c6222f5bb5100a1043aa6f8e8133224173
Author: Karl Williamson <[email protected]>
Date: Wed Apr 18 15:25:28 2012 -0600
utf8.h: Use correct definition of start byte
The previous definition allowed for (illegal) overlongs
M utf8.h
M utfebcdic.h
commit ae0c490d71803ef2275f7f292e2a37ee011006ed
Author: Christian Hansen <[email protected]>
Date: Wed Apr 18 14:32:16 2012 -0600
utf8.h: Use correct UTF-8 downgradeable definition
Previously, the macro changed by this commit would accept overlong
sequences.
The committer changed the original patch to swap a mask instead of a
test, in keeping with the prior version of the code; and to include
EBCDIC changes.
M AUTHORS
M t/op/print.t
M utf8.h
M utfebcdic.h
commit 05cece3b654cc7408b5b7a1c1b16e1cc54d94c9c
Author: Karl Williamson <[email protected]>
Date: Wed Apr 18 14:07:33 2012 -0600
test.pl: Add fresh_perl_unlike()
M t/test.pl
-----------------------------------------------------------------------
--
Perl5 Master Repository