Branch: refs/heads/blead
Home: https://github.com/Perl/perl5
Commit: 61405b774c1372b18658a22e8b0f05df1456e676
https://github.com/Perl/perl5/commit/61405b774c1372b18658a22e8b0f05df1456e676
Author: Karl Williamson <[email protected]>
Date: 2024-01-01 (Mon, 01 Jan 2024)
Changed paths:
M t/op/tr.t
M toke.c
Log Message:
-----------
Fix tr/\N{latin1}...\N{above latin1}/
When a string is being parsed, it isn't made UTF-8 until necessary; that
is, when it first finds a character that requires UTF-8 to represent. If
all the characters prior to that one are ASCII, all that is needed is to
convert that one to UTF-8 and to turn on the UTF-8 flag, so that all
future characters encountered in the parse will be represented in UTF-8.
This is because all ASCII characters have the same representation in
UTF-8 as not; they are "UTF-8 invariant". But if a UTF-8 *variant*
character was in the string prior to the UTF-8-required one, it must be
converted to its UTF-8 representation, when the string is converted.
All that is needed is to increment a count of variant characters as the
parse proceeds.
If nothing in the string requires UTF-8 by the end of the parse, the
count is ignored and the string remains non-UTF-8.
And if the count is zero when a UTF-8-required character is found, as
mentioned above, that character is converted to UTF-8, and the flag is
set to use UTF-8 going forward.
But a non-zero count at the first UTF-8-required character indicates
that before proceeding, the already-parsed string must be reparsed to
convert the variant characters already in it to UTF-8.
The count was not being incremented when the input notation used \N{};
this commit fixes that. It was being incremented when the input
notation used \x{}, which is much more common in the field, so this bug
was unnoticed for a long time.
Fixes #21748
(Just for the record, on EBCDIC platforms more characters are UTF-8
invariant than ASCII platforms; the macros called here hide that from
the code.)