Branch: refs/heads/blead Home: https://github.com/Perl/perl5 Commit: fda991d326d6b70c7882251a4c756096d18f9dac https://github.com/Perl/perl5/commit/fda991d326d6b70c7882251a4c756096d18f9dac Author: Karl Williamson <k...@cpan.org> Date: 2025-03-17 (Mon, 17 Mar 2025)
Changed paths: M ext/XS-APItest/t/utf8_warn_base.pl Log Message: ----------- APItest: Skip some utf8n_to_utf8_msgs tests This function returns values in an AV instead of raising warnings. It turns out that this test file gets some of it wrong. And this test file turns out to be inadequate in other ways. I have rewritten the test file, but there isn't time to get it in before the code-complete deadline. Fixes here will end up being discarded. In order to get the code that is actually part of the perl interpreter into this release, I've skipped the test that would fail here, and made sure it all passes the rewritten test. Commit: c2d3dc14ba2d42fcddcc2f115719f5e91dce8838 https://github.com/Perl/perl5/commit/c2d3dc14ba2d42fcddcc2f115719f5e91dce8838 Author: Karl Williamson <k...@cpan.org> Date: 2025-03-17 (Mon, 17 Mar 2025) Changed paths: M inline.h Log Message: ----------- utf8_to_uvchr_buf: Remove assertion This variable is no longer accessed directly by this function. Any assertion about it should come from the function this passes it to. Commit: 4c68b3773524a11538d9988c528fc50f75e5376e https://github.com/Perl/perl5/commit/4c68b3773524a11538d9988c528fc50f75e5376e Author: Karl Williamson <k...@cpan.org> Date: 2025-03-17 (Mon, 17 Mar 2025) Changed paths: M utf8.c Log Message: ----------- utf8_to_uv_msgs: Extract redundant code to common This case: has two occurrences of the same statement, within two different conditionals. But the case: doesn't get executed unless at least one of those conditionals is known to be true. Therefore the statement is guaranteed to be executed at least once; no need to have two copies. Commit: ff7238915cd406becedc725df021d41d6107c725 https://github.com/Perl/perl5/commit/ff7238915cd406becedc725df021d41d6107c725 Author: Karl Williamson <k...@cpan.org> Date: 2025-03-17 (Mon, 17 Mar 2025) Changed paths: M utf8.c Log Message: ----------- utf8_to_uv_msgs: Make code less brittle Processing the overlong malformation needed to be last because it likely would overwrite the calculated UV. Other cases also overwrote that. This is unnecessarily brittle, as we can simply store the UV before processing any cases, and then refer to that copy. Commit: fa3575aa7c635692e71590ace9b9ab1a9472a082 https://github.com/Perl/perl5/commit/fa3575aa7c635692e71590ace9b9ab1a9472a082 Author: Karl Williamson <k...@cpan.org> Date: 2025-03-17 (Mon, 17 Mar 2025) Changed paths: M utf8.c Log Message: ----------- utf8_to_uv_msgs: Add safety assignment This sets the accumulated code point to UV_MAX when overflow is detected. Much further below the REPLACEMENT CHARACTER is returned instead; but this makes sure that code in between doesn't get confused by an intermediate value Commit: 8d314759431def20744589a722028fad3936db06 https://github.com/Perl/perl5/commit/8d314759431def20744589a722028fad3936db06 Author: Karl Williamson <k...@cpan.org> Date: 2025-03-17 (Mon, 17 Mar 2025) Changed paths: M utf8.c Log Message: ----------- utf8_to_uv_msgs: Simplfiy checking for overlong Prior to this commit, there were two different methods for doing this check; one if no malformations have been found so far, and the other if some had been found. The latter method is valid in both cases, and is just as fast or faster than the first method. So change to always use it Commit: 7c94d7394063bf8af825fa5e6e86f66d73530aff https://github.com/Perl/perl5/commit/7c94d7394063bf8af825fa5e6e86f66d73530aff Author: Karl Williamson <k...@cpan.org> Date: 2025-03-17 (Mon, 17 Mar 2025) Changed paths: M utf8.c M utf8.h Log Message: ----------- utf8_to_uv_msgs: Revise assert More extensive testing revealed that more conditions than this assert previously contained are legitimate. This requireb defining the name for a flag Commit: 71c5788cff7cf77f6418cd2f8da423a17c0d54ec https://github.com/Perl/perl5/commit/71c5788cff7cf77f6418cd2f8da423a17c0d54ec Author: Karl Williamson <k...@cpan.org> Date: 2025-03-17 (Mon, 17 Mar 2025) Changed paths: M utf8.c Log Message: ----------- utf8_to_uv_msgs: Swap order of switch() cases The overlong cases more logically belong with the other conditions that are rejected by default. Future commits will simplify this to look much more like those other conditions. Commit: c4df0807ee70540989f027889680087ddc64fd41 https://github.com/Perl/perl5/commit/c4df0807ee70540989f027889680087ddc64fd41 Author: Karl Williamson <k...@cpan.org> Date: 2025-03-17 (Mon, 17 Mar 2025) Changed paths: M utf8.c Log Message: ----------- utf8_to_uv_msgs: Move conditional to earlier to avoid work By checking before we go to the trouble to do something, rather than in the middle of it, we can save some work. The new test looks at the source UTF-8; the previous one looked at the code point calculated from it Commit: e0627d5bc831dc78112940aee98d27e2a729f82e https://github.com/Perl/perl5/commit/e0627d5bc831dc78112940aee98d27e2a729f82e Author: Karl Williamson <k...@cpan.org> Date: 2025-03-17 (Mon, 17 Mar 2025) Changed paths: M utf8.c Log Message: ----------- utf8_to_uv_msgs: De-duplicate common code This removes the duplicate code from many of the case statements in a switch to be common before the switch, with a single conditional controlling them Commit: 3786151d3e0ef00885a777e77cb7b156ad86b463 https://github.com/Perl/perl5/commit/3786151d3e0ef00885a777e77cb7b156ad86b463 Author: Karl Williamson <k...@cpan.org> Date: 2025-03-17 (Mon, 17 Mar 2025) Changed paths: M ext/XS-APItest/t/utf8_warn_base.pl Log Message: ----------- Skip testing utf8 translating for the next few commits The next few commits will fail these tests. I could squash them all together, but that would hide the step by step change progess. This should allow future bisecting to not fail in this commit window. Commit: 6507d4a56a59e7a3008fb4f2228a50013c3e404c https://github.com/Perl/perl5/commit/6507d4a56a59e7a3008fb4f2228a50013c3e404c Author: Karl Williamson <k...@cpan.org> Date: 2025-03-17 (Mon, 17 Mar 2025) Changed paths: M utf8.c Log Message: ----------- utf8_to_uv_msgs: Reverse order of finding overflow/extended UTF-8 This begins the process of fixing the current problematic behavior of handling UTF-8 that is for code points above the Unicode maximum. The lowest of these are considered SUPERs, but if you go high enough, it takes Perl's extended UTF-8 to represent them. Higher still, and the extended UTF-8 can represent code points that don't fit in the current platform's word size. A complication is overlongs, where the representation for a seemingly large code point can reduce down to something much smaller; even 0. Such sequences are considered invalid by fiat from Unicode due to successful hacker attacks using them. But Perl has traditionally allowed XS code to allow them, with flags passed to the translation functions. So it is important to get it right. A sequence that overflows by necessity is using Perl's extended UTF-8, as that kicks in below a 32 bit word. This commit reverses the prior order of testing for overflow and extended UTF-8. Steps can be saved because we now test for Perl-extended first, which is a lot more likely to happen than overflow. Commit: b1a21fc8531cf47ab0645788b5644916ed91620a https://github.com/Perl/perl5/commit/b1a21fc8531cf47ab0645788b5644916ed91620a Author: Karl Williamson <k...@cpan.org> Date: 2025-03-17 (Mon, 17 Mar 2025) Changed paths: M utf8.c Log Message: ----------- utf8_to_uv_msgs: Fix handling of too-short malformations At this point in the code we know that the input sequence is shorter than a full character and that it is the legal beginning of a sequence that could evaluate to a code point that is of interest to the caller of this function. It turns out that in some cases any filling out of the input to a full character must lead to a code point that the caller is interested in. That interest has been signalled by flags passed to this function. In the past, we filled out the sequence with the minimum legal continuation byte, but that is wrong for some cases. This commit fixes that. Certain start bytes require the second byte to be higher than the minimum, or else it is an overlong. Prior to this commit, we could generate overlongs. This commit avoids that pitfall. It also moves the complex analysis away from the comments in the code, and to this commit message, adding even more analysis. There are four classes of code points that the caller can have signalled to this function that it is interested in. The noncharacter code point class always needs a full sequence to determine, and the conditionals prevent the code this analasys is about from being executed. Use of Perl extended-UTF-8 is determinable from the first byte in the input sequence, and that has already been determined. Both of the other two sequences don't have to be fully filled out in order to determine if a partial sequence would lead to them or not. Consider first, the sequences that evaluate to an above-Unicode code point, charmingly named "supers" by Perl's poetic coders. ASCII platforms EBCDIC I8 U+10FFFF: \xF4\x8F\xBF\xBF \xF9\xA1\xBF\xBF\xBF 0x110000: \xF4\x90\x80\x80 \xF9\xA2\xA0\xA0\xA0 * (Continuation byte range): \x80 to \xbf \xa0 to \xbf On ASCII platforms, any start byte \xf3 and below can't be for a super, and any non-overlong sequence \xf5 and above has to be for a super. If the start byte is \xf4, we need a second byte to resolve the ambiguity. But it takes just the one, or possibly two bytes to make the determination. It's similar on EBCDIC, but with different values. And a similar situation exists for the surrogates. The range of non-overlong surrogates is: ASCII platforms EBCDIC I8 "\xed\xa0\x80" "\xf1\xb6\xa0\xa0" to "\xed\xbf\xbf". "\xf1\xb7\xbf\xbf" In both platforms, if we have the first two bytes, we can tell if it is a surrogate or not, as all legal continuations in the rest of the byte positions are for surrogates. If we have only one byte, we can't tell, so we have to assume it isn't a surrogate. Overlongs don't meaningfully change things. The shortest ASCII overlong for the first surrogate is "\xf0\x8d\xa0\x80" and for the highest surrogate it is "\xf0\x8d\xbf\xbf". Note that only the first byte has been changed, into two bytes. All but the first byte is the same for any overlong of any code point in either ASCII or EBCDIC. This means the algorithm for filling things out works for these two classes in all cases. Note also that the upper end of the range conveniently works out without any extra effort needed. The highest surrogate corresponds to the highest continuation bytes. And the highest super that fits in the platform will also use the highest continuation bytes. The start bytes that need to have the fix in this commit are the ones that could be the start of overlongs, minus the lower ones which can represent only code points smaller than any of the ones the caller can flag as being "interesting" (U+D800 is that value), and minus 0xFF. Hence 0xE0 can have overlongs, but it and its overlongs can only represent code points lower than 0xD800. So we don't have to worry about it or any smaller start byte. But the reason 0xFF doesn't have to be considered is more complex. It isn't the second byte in a sequence beginning with FF that needs to be higher than the minimum continuation, but one further in. This would make things harder except that any sequence beginning with 0xFF is Perl-extended UTF-8, and has already been considered earlier in this function. This code is only executed when 'must_be_super' is false. 'must_be_super' is set true if the sequence overflows or there is no detectable overlong. By DeMorgan's laws, this means to get here, it doesn't overflow, and must be overlong. To know that it is overlong, we must have seen enough bytes to get past the point where we need a higher continuation byte to legally fill it out. So we can just fill the rest with the minimum continuation. (Note that the same reasoning would apply to 0xFE on ASCII platforms. That is also used only by Perl-extended UTF-8, so would have been considered earlier, and to get here we know it has to be overlong, and so we've already seen enough bytes to not need to handle it specially. But it fits into the same paradigm as the lower start bytes with just the second byte needing to be higher, and there is no extra code required to handle it besides including a case: for it in the switch(). This works in both ASCII and EBCDIC.) Commit: d1fba02797d2739f81e5a840744f4e675039a926 https://github.com/Perl/perl5/commit/d1fba02797d2739f81e5a840744f4e675039a926 Author: Karl Williamson <k...@cpan.org> Date: 2025-03-17 (Mon, 17 Mar 2025) Changed paths: M utf8.c Log Message: ----------- utf8_to_uv_msgs: De-duplicate some more code This moves a conditional found in all cases in a switch() to just before the switch, so the code is not duplicated. Commit: 78cced2399d45a6ef0f3fffa06e6b6aad434885e https://github.com/Perl/perl5/commit/78cced2399d45a6ef0f3fffa06e6b6aad434885e Author: Karl Williamson <k...@cpan.org> Date: 2025-03-17 (Mon, 17 Mar 2025) Changed paths: M utf8.c Log Message: ----------- Swap comment order It is more easily understood reversed Commit: 9540191231a7d4317417176944cc51a7a16a6ead https://github.com/Perl/perl5/commit/9540191231a7d4317417176944cc51a7a16a6ead Author: Karl Williamson <k...@cpan.org> Date: 2025-03-17 (Mon, 17 Mar 2025) Changed paths: M utf8.c Log Message: ----------- utf8_to_uv_msgs: Revise and rename macro This macro is used to hide the details of determining if an abnormal condition should raise a warning or not. But I found it more convenient to expand the macro to return the packed warnings category(ies) if a warning should be raised or not. That information is known inside the macro and was being discarded, and then having to be recalculated. The new name reflects its expanded purpose, PACK_WARN. 0 is returned if no warnings need be raised; and importantly fixing a bug in the old code, it returns < 0 if no warning should be raised directly, but that an entry needs to be added to the AV array returned by the function (if the parameter requesting that has been passed in) But Encode, for which this form of the translation function was created, and may be the only user of it, depends on not getting a zero return. So this has an override until Encode can be fixed. I introduced the DIE_IF_MALFORMED flag in the previous development release, making it subservient to the CHECK_ONLY flag. I have since realized that the precedence should be reversed. If a developer inadvertently passes both flags, it is better to honor the one saying you need to quit, than the one saying ignore any problems. Commit: 3aca733b2ab5c0e9c293648f169ec3320e2473ce https://github.com/Perl/perl5/commit/3aca733b2ab5c0e9c293648f169ec3320e2473ce Author: Karl Williamson <k...@cpan.org> Date: 2025-03-17 (Mon, 17 Mar 2025) Changed paths: M utf8.c Log Message: ----------- perlapi: DIE_IF_MALFORMED overrides CHECK_ONLY This documents the change in the previous commit Commit: 2a00b118018027a7154164b0fa646dab8ce52114 https://github.com/Perl/perl5/commit/2a00b118018027a7154164b0fa646dab8ce52114 Author: Karl Williamson <k...@cpan.org> Date: 2025-03-17 (Mon, 17 Mar 2025) Changed paths: M utf8.c Log Message: ----------- utf8_to_uv_msgs: Create a common macro Previous commits have allowed the beginning of several of the case statements in this switch() to have the same code. This commit creates a macro encapsulating that code and changes the cases to use it. The macro continues the enclosing loop if no message needs to be generated. This allows the removal of various conditional blocks. And it means that these conditions don't break to the bottom of the switch() if no message is needed. Braces are needed in one case: so as to not run afoul of C++ initialization crossing Commit: 51fbc1cf7bbf9c3dd6c4159811322dd1fec39bd5 https://github.com/Perl/perl5/commit/51fbc1cf7bbf9c3dd6c4159811322dd1fec39bd5 Author: Karl Williamson <k...@cpan.org> Date: 2025-03-17 (Mon, 17 Mar 2025) Changed paths: M utf8.c Log Message: ----------- utf8_to_uv_msgs: Convert switch case to use macro By changing flags earlier in the function, we can convert this case in a switch to use the macro introduced in the previous commit Commit: 22d8ec6da1d8e799219e7f66fd962fa9ccb2d4af https://github.com/Perl/perl5/commit/22d8ec6da1d8e799219e7f66fd962fa9ccb2d4af Author: Karl Williamson <k...@cpan.org> Date: 2025-03-17 (Mon, 17 Mar 2025) Changed paths: M utf8.c Log Message: ----------- utf8_to_uv_msgs: Create another common macro This new macro allows two more case statements in the switch to have a common macro at their beginnings, instead of having to repeat code. Commit: 238a42b9ab5c4303236baee35cb3e4c0bee96c1b https://github.com/Perl/perl5/commit/238a42b9ab5c4303236baee35cb3e4c0bee96c1b Author: Karl Williamson <k...@cpan.org> Date: 2025-03-17 (Mon, 17 Mar 2025) Changed paths: M utf8.c Log Message: ----------- utf8_to_uv_msgs: Revamp handling of above-Unicode code points As stated in a recent commit message, this is complex and problematic. This commit revamps it, simplifying it and fixing the known remaining bugs. Commit: b48d541824a35068f1b154f90382b130581a8c22 https://github.com/Perl/perl5/commit/b48d541824a35068f1b154f90382b130581a8c22 Author: Karl Williamson <k...@cpan.org> Date: 2025-03-17 (Mon, 17 Mar 2025) Changed paths: M ext/XS-APItest/t/utf8_warn_base.pl Log Message: ----------- Reinstate utf8 translation testing The previous commit fixed the remaining problems that this test finds, and so it can be turned on again. Commit: 061644d72c121af5a398b469c7f96f5bd8af97fd https://github.com/Perl/perl5/commit/061644d72c121af5a398b469c7f96f5bd8af97fd Author: Karl Williamson <k...@cpan.org> Date: 2025-03-17 (Mon, 17 Mar 2025) Changed paths: M utf8.c Log Message: ----------- utf8.c: Remove no longer used #define Commit: e6951798a38b8abe7d29fcf83a8ecd43f1579b18 https://github.com/Perl/perl5/commit/e6951798a38b8abe7d29fcf83a8ecd43f1579b18 Author: Karl Williamson <k...@cpan.org> Date: 2025-03-17 (Mon, 17 Mar 2025) Changed paths: M utf8.c Log Message: ----------- utf8_to_uv_msgs: Remove redundant conditionals The comments added to the code in this commit explain that to get here, something needs to be done; no need to test again. Commit: 5104b9a382f597762915def215ec5632a7d1c549 https://github.com/Perl/perl5/commit/5104b9a382f597762915def215ec5632a7d1c549 Author: Karl Williamson <k...@cpan.org> Date: 2025-03-17 (Mon, 17 Mar 2025) Changed paths: M utf8.c Log Message: ----------- utf8_to_uv_msgs: Add, clarify comments Commit: 6b30aa30618e2ef4b0029caffdd2a51e645e4970 https://github.com/Perl/perl5/commit/6b30aa30618e2ef4b0029caffdd2a51e645e4970 Author: Karl Williamson <k...@cpan.org> Date: 2025-03-17 (Mon, 17 Mar 2025) Changed paths: M utf8.c Log Message: ----------- utf8_to_uv_msgs: Use already computed value Instead of doing the subtraction again, use the variable that already contains the desired value. Commit: b0ac0a6283361db570723f33b6ea2d5e3233177e https://github.com/Perl/perl5/commit/b0ac0a6283361db570723f33b6ea2d5e3233177e Author: Karl Williamson <k...@cpan.org> Date: 2025-03-17 (Mon, 17 Mar 2025) Changed paths: M utf8.c Log Message: ----------- utf8.c: White-space only Outdent after removing enclosing braces Commit: cab4c628207ecc6c954766194e3dbc02ed196f69 https://github.com/Perl/perl5/commit/cab4c628207ecc6c954766194e3dbc02ed196f69 Author: Karl Williamson <k...@cpan.org> Date: 2025-03-17 (Mon, 17 Mar 2025) Changed paths: M utf8.c Log Message: ----------- utf8_to_uv_msgs: Assert against both returning and warning This asserts against the flags to the call of this function being contradictory, in that it is boths 1) to warn and/or die if anything goes wrong; and 2) not to warn under any circumstances but instead to return to the caller objects describing what it would have otherise warned. In a non-DEBUGGING build, the warn/die flags are ignored Commit: a1805b9cc667e10264e904172273826bb0e46360 https://github.com/Perl/perl5/commit/a1805b9cc667e10264e904172273826bb0e46360 Author: Karl Williamson <k...@cpan.org> Date: 2025-03-17 (Mon, 17 Mar 2025) Changed paths: M ext/XS-APItest/t/utf8_warn_base.pl M inline.h M utf8.c M utf8.h Log Message: ----------- Merge branch 'Fix utf8 corner cases' into blead There are around 20 different functions that take a UTF-8 sequence of bytes and try to find the ordinal code point represented by them. It was becoming clear that the existing tests in our suite were inadequate, not finding glaring bugs. And UTF-8 handling is important, with failures in it having been exploited by hackers in various products over the years for various nefarious purposes. I set out to improve the tests, spending way too much time before realizing that adding band aids to the current scheme was not going to work out. So I undertook rewriting the tests. This turned out to be way harder and time consuming than I expected. And it still isn't ready to go into blead. But along the way, I discovered that it was finding corner case bugs that I would never have anticipated. This series of commits fixes those, while simplifying the code and reducing redundancy. The new test file needs clean-up, and probably ways to make it faster, but it is finally far enough along that I believe it has caught most of the bugs out there. So I'm submitting these now to get into v5.42. The deadline for the test file is later in the development process. Compare: https://github.com/Perl/perl5/compare/6a4f62c87346...a1805b9cc667 To unsubscribe from these emails, change your notification settings at https://github.com/Perl/perl5/settings/notifications