Branch: refs/heads/blead
  Home:   https://github.com/Perl/perl5
  Commit: a66489bb47fd2e66a5fb15dd3372a76fdc0c8352
      
https://github.com/Perl/perl5/commit/a66489bb47fd2e66a5fb15dd3372a76fdc0c8352
  Author: Karl Williamson <k...@cpan.org>
  Date:   2024-10-19 (Sat, 19 Oct 2024)

  Changed paths:
    M ext/XS-APItest/t/utf8_warn_base.pl

  Log Message:
  -----------
  APItest/t/utf8_warn_base: Change $variable name

'j' isn't very descriptive for this purpose


  Commit: ac3e75745847036d739523b0ad393c7cfb14aa5f
      
https://github.com/Perl/perl5/commit/ac3e75745847036d739523b0ad393c7cfb14aa5f
  Author: Karl Williamson <k...@cpan.org>
  Date:   2024-10-19 (Sat, 19 Oct 2024)

  Changed paths:
    M utf8.c

  Log Message:
  -----------
  utf8.c: Add comments; move decls closer to use


  Commit: b015ed30e53236c4dac7d6061313d7b7a51aa911
      
https://github.com/Perl/perl5/commit/b015ed30e53236c4dac7d6061313d7b7a51aa911
  Author: Karl Williamson <k...@cpan.org>
  Date:   2024-10-19 (Sat, 19 Oct 2024)

  Changed paths:
    M utf8.c

  Log Message:
  -----------
  utf8.c: Change #define name

I have a better use for this name in mind; and this makes the name more
specific


  Commit: 5718b263e61fd6938de8fc052585d16b0bb15324
      
https://github.com/Perl/perl5/commit/5718b263e61fd6938de8fc052585d16b0bb15324
  Author: Karl Williamson <k...@cpan.org>
  Date:   2024-10-19 (Sat, 19 Oct 2024)

  Changed paths:
    M embed.fnc
    M proto.h
    M t/op/utf8decode.t
    M utf8.c

  Log Message:
  -----------
  Improve UTF-8 overflow/overlong handling

Perl's extended UTF-8 is capable of representing code points up to 2**72
(2**65 on EBCDIC).  These won't fit in a 64 bit word, and hence
overflow.  (And much more so on 32 bit machines.)  A start byte of \xFF
is required for code points starting with 2**36; \xFE for those starting
with 2**31, and so on.  But it turns out that a sequence beginning with
\xFE can express all code points 0..2**36-1, and \xFF sequences can
express everything 0..2**72-1.

When a sequence represents a code point that can be expressed by a
shorter sequence, it is called an overlong, and using those is expressly
forbidden by the Unicode standard due to spoofing attacks that have
occurred.  So, a \xFE start byte should only be used for code points
in the range 2**31..2**36-1; and \xFF only for 2**36..2**65-1.  But
Perl needs to handle the possibility where the input doesn't match the
expectations of what it should be.

We have tried to determine all the malformations that apply to a given
sequence and return them to the caller when requested.  The interplay
between overflow and overlong is somewhat tricky, and the new tests that
are to be added in the next commit showed that we haven't been doing it
completely right.

Prior to this commit, the checks for both overlong and overflow had
three states: yes, no, and maybe.  The last meaning that the sequence
being examined was shorter than a full character, and that some possible
completions of it would result in yes, and some would result in no.

This commit retains the tripartite state of examining a sequence for
being overlong, but adds a fourth state for overflow, namely that the
input overflows unless the sequence is overlong, and there aren't enough
bytes to determine the latter absolutely for sure.  But overlongs are
rare, so the chances of it being that are tiny, so this state means that
it almost certainly overflows.

Prior to this commit, I had tried to cope with some of this by an extra
parameter to the find-if-overflow function, but this fourth state
removes the need for that.  The caller gets which state the input is,
and then chooses ow to handle it, without needing the parameter.

The tests in utf8decode.t also had to be changes, as this new code picks
up some overflows on 32-bit machines that were previously not caught.


  Commit: b2215f685caf853ae0132e38a53f244a958ec930
      
https://github.com/Perl/perl5/commit/b2215f685caf853ae0132e38a53f244a958ec930
  Author: Karl Williamson <k...@cpan.org>
  Date:   2024-10-19 (Sat, 19 Oct 2024)

  Changed paths:
    M ext/XS-APItest/t/utf8_warn_base.pl

  Log Message:
  -----------
  APItest/t/utf8_warn_base: Add tests

One UTF-8 malformation is when the string has a start byte in it before
the expected end of the character.  This test file tested the case where
the unexpected byte came in the final position.  GH #22597 found bugs
where the unexpected byte came immediately after the first byte.

This commit adds tests for unexpected bytes in all possible positions.
If the fix for GH #22597 is reverted, this new revised file has 1400
failures.


Compare: https://github.com/Perl/perl5/compare/9621dfa82225...b2215f685caf

To unsubscribe from these emails, change your notification settings at 
https://github.com/Perl/perl5/settings/notifications

Reply via email to