[perl.git] branch smoke-me/khw-smoke, created. v5.15.9-262-gcd7e6c8

Karl Williamson Thu, 26 Apr 2012 11:04:13 -0700

In perl.git, the branch smoke-me/khw-smoke has been created

<http://perl5.git.perl.org/perl.git/commitdiff/cd7e6c884f038d4463b1c4768533b484e5c5c919?hp=0000000000000000000000000000000000000000>


        at  cd7e6c884f038d4463b1c4768533b484e5c5c919 (commit)

- Log -----------------------------------------------------------------
commit cd7e6c884f038d4463b1c4768533b484e5c5c919
Author: Karl Williamson <[email protected]>
Date:   Wed Apr 18 22:14:15 2012 -0600

    is_utf8_char_slow(): Avoid accepting overlongs
    
    There are possible overlong sequences that this function blindly
    accepts.  Instead of developing the code to figure this out, turn this
    function into a wrapper for utf8n_to_uvuni() which already has this
    check.

M       utf8.c

commit 524080c4d32ea2975130ce2ce31f3b3d508bf140
Author: Karl Williamson <[email protected]>
Date:   Wed Apr 18 18:32:57 2012 -0600

    perlapi: Update for changes in utf8 decoding

M       utf8.c

commit f555bc63534ca05176eb37540a96c0e644dbadff
Author: Karl Williamson <[email protected]>
Date:   Mon Apr 23 13:28:32 2012 -0600

    utf8.c: White-space only
    
    This outdents to account for the removal of a surrounding block.

M       utf8.c

commit eb83ed87110e41de6a4cd4463f75df60798a9243
Author: Karl Williamson <[email protected]>
Date:   Wed Apr 18 17:36:01 2012 -0600

    utf8.c: refactor utf8n_to_uvuni()
    
    The prior version had a number of issues, some of which have been taken
    care of in previous commits.
    
    The goal when presented with malformed input is to consume as few bytes
    as possible, so as to position the input for the next try to the first
    possible byte that could be the beginning of a character.  We don't want
    to consume too few bytes, so that the next call has us thinking that
    what is the middle of a character is really the beginning; nor do we
    want to consume too many, so as to skip valid input characters.  (This
    is forbidden by the Unicode standard because of security
    considerations.)  The previous code could do both of these under various
    circumstances.
    
    In some cases it took as a given that the first byte in a character is
    correct, and skipped looking at the rest of the bytes in the sequence.
    This is wrong when just that first byte is garbled.  We have to look at
    all bytes in the expected sequence to make sure it hasn't been
    prematurely terminated from what we were led to expect by that first
    byte.
    
    Likewise when we get an overflow: we have to keep looking at each byte
    in the sequence.  It may be that the initial byte was garbled, so that
    it appeared that there was going to be overflow, but in reality, the
    input was supposed to be a shorter sequence that doesn't overflow.  We
    want to have an error on that shorter sequence, and advance the pointer
    to just beyond it, which is the first position where a valid character
    could start.
    
    This fixes a long-standing TODO from an externally supplied utf8 decode
    test suite.
    
    And, the old algorithm for finding overflow failed to detect it on some
    inputs.  This was spotted by Hugo van der Sanden, who suggested the new
    algorithm that this commit uses, and which should work in all instances.
    For example, on a 32-bit machine, any string beginning with "\xFE" and
    having the next byte be either "\x86" or \x87 overflows, but this was
    missed by the old algorithm.
    
    Another bug was that the code was careless about what happens when a
    malformation occurs that the input flags allow. For example, a sequence
    should not start with a continuation byte.  If that malformation is
    allowed, the code pretended it is a start byte and extracts the "length"
    of the sequence from it.  But pretending it is a start byte is not the
    same thing as it actually being a start byte, and so there is no
    extractable length in it, so the number that this code thought was
    "length" was bogus.
    
    Yet another bug fixed is that if only the warning subcategories of the
    utf8 category were turned on, and not the entire utf8 category itself,
    warnings were not raised that should have been.
    
    And yet another change is that given malformed input with warnings
    turned off, this function used to return whatever it had computed so
    far, which is incomplete or erroneous garbage.  This commit changes to
    return the REPLACEMENT CHARACTER instead.
    
    Thanks to Hugo van der Sanden for reviewing and finding problems with an
    earlier version of these commits

M       Porting/perl5160delta.pod
M       ext/XS-APItest/APItest.xs
M       ext/XS-APItest/t/utf8.t
M       t/op/utf8decode.t
M       utf8.c
M       utf8.h

commit 0b8d30e8ba4bed9219a0a08549fd9d07661587ee
Author: Karl Williamson <[email protected]>
Date:   Wed Apr 18 16:48:29 2012 -0600

    utf8n_to_uvuni: Avoid reading outside of buffer
    
    Prior to this patch, if the first byte of a UTF-8 sequence indicated
    that the sequence occupied n bytes, but the input parameters indicated
    that fewer were available, all n were attempted to be read

M       utf8.c

commit 746afd533cc96b75c8a3c821291822f0c0ce7e2a
Author: Karl Williamson <[email protected]>
Date:   Wed Apr 18 16:35:39 2012 -0600

    utf8.c: Clarify and correct pod
    
    Some of these were spotted by Hugo van der Sanden

M       utf8.c

commit 99ee1dcd0469086e91a96e31a9b9ea27bb7f0c7e
Author: Karl Williamson <[email protected]>
Date:   Wed Apr 18 16:20:22 2012 -0600

    utf8.c: Use macros instead of if..else.. sequence
    
    There are two existing macros that do the job that this longish sequence
    does.  One, UTF8SKIP(), does an array lookup and is very likely to be in
    the machine's cache as it is used ubiquitously when processing UTF-8.
    The other is a simple test and shift.  These simplify the code and
    should speed things up as well.

M       utf8.c

commit 0447e8df3db3f566f76a613f62c5f4cdd7262997
Author: Karl Williamson <[email protected]>
Date:   Wed Apr 18 15:25:28 2012 -0600

    utf8.h: Use correct definition of start byte
    
    The previous definition allowed for (illegal) overlongs.  The uses of
    this macro in the core assume that it is accurate.  The inacurracy can
    cause such code to fail.

M       utf8.h
M       utfebcdic.h

commit 0ae1fa71a437dfa435b139674610ec992366d661
Author: Christian Hansen <[email protected]>
Date:   Wed Apr 18 14:32:16 2012 -0600

    utf8.h: Use correct UTF-8 downgradeable definition
    
    Previously, the macro changed by this commit would accept overlong
    sequences.
    
    This patch was changed by the committer to to include EBCDIC changes;
    and in the non-EBCDIC case, to save a test, by using a mask instead, in
    keeping with the prior version of the code

M       AUTHORS
M       t/op/print.t
M       utf8.h
M       utfebcdic.h

commit dd53ca2f01b45dd5a54bd2d00709dbfbe00ccdf3
Author: Brian Fraser <[email protected]>
Date:   Fri Apr 20 22:09:56 2012 -0300

    Make unicode label tests use unicode_eval.
    
    A recent change exposed a faulty test, in t/uni/labels.t;
    Previously, a downgraded label passed to eval under 'use utf8;'
    would've been erroneously considered UTF-8 and the tests
    would pass. Now it's correctly reported as illegal UTF-8
    unless unicode_eval is in effect.

M       t/uni/labels.t
-----------------------------------------------------------------------

--
Perl5 Master Repository

[perl.git] branch smoke-me/khw-smoke, created. v5.15.9-262-gcd7e6c8

Reply via email to