Branch: refs/heads/blead
  Home:   https://github.com/Perl/perl5
  Commit: 41cd62eeb2dbacd4956c1ec09d8b8efa7d21aaa5
      
https://github.com/Perl/perl5/commit/41cd62eeb2dbacd4956c1ec09d8b8efa7d21aaa5
  Author: Karl Williamson <[email protected]>
  Date:   2026-01-29 (Thu, 29 Jan 2026)

  Changed paths:
    M regcomp.c

  Log Message:
  -----------
  regcomp.c: Move assertions; white space, braces only


  Commit: e2ca987cf3a1ff11e63c16aad8549bbeac81905c
      
https://github.com/Perl/perl5/commit/e2ca987cf3a1ff11e63c16aad8549bbeac81905c
  Author: Karl Williamson <[email protected]>
  Date:   2026-01-29 (Thu, 29 Jan 2026)

  Changed paths:
    M charclass_invlists.inc
    M embed.h
    M lib/unicore/mktables
    M lib/unicore/uni_keywords.pl
    M regcharclass.h
    M regen/regcharclass.pl
    M regexp_constants.h
    M uni_keywords.h

  Log Message:
  -----------
  Add Unicode property for forbidden name characters

Somewhat fewer than 200 characters in Unicode are in \w but not
IDContinue, and vice versa.  This is surprising, but it stems from both
decisions made by Perl and those made by Unicode.

Perl has long forbidden a character from being a continuation and not a
\w.  This is because of long-standing assumptions that modules have made
that we didn't want to disrupt.

Unicode has 150-ish \w characters that they don't consider continuations.
The reasons are spelled out in
https://www.unicode.org/reports/tr31
particularly section 5.

Essentially, things go better if you can round-trip normalize
identifiers with all possible standard normalizations.

This, for example, excludes word characters that have enclosing shapes
associated with them, as CIRCLED LATIN CAPITAL LETTER M does.  When you
normalize it using NFKC, the CIRCLED goes away, and there is no way to
get it back.

Currently, if someone were to use one of these few characters in an
identifier, Perl will give some obscure error message.  By making them
discoverable with a macro generated by regcharclass.pl, we can look
explicitly for them and deliver a meaningful message; that's done in the
next commits.

For reference, here is a complete list of the affected characters that
this property matches:
037A GREEK YPOGEGRAMMENI
0488 COMBINING CYRILLIC HUNDRED THOUSANDS SIGN
0489 COMBINING CYRILLIC MILLIONS SIGN
1ABE COMBINING PARENTHESES OVERLAY
20DD COMBINING ENCLOSING CIRCLE
20DE COMBINING ENCLOSING SQUARE
20DF COMBINING ENCLOSING DIAMOND
20E0 COMBINING ENCLOSING CIRCLE BACKSLASH
20E2 COMBINING ENCLOSING SCREEN
20E3 COMBINING ENCLOSING KEYCAP
20E4 COMBINING ENCLOSING UPWARD POINTING TRIANGLE
24B6 CIRCLED LATIN CAPITAL LETTER A - Z
24D0 CIRCLED LATIN SMALL LETTER A -Z
2E2F VERTICAL TILDE
A670 COMBINING CYRILLIC TEN MILLIONS SIGN
A671 COMBINING CYRILLIC HUNDRED MILLIONS SIGN
A672 COMBINING CYRILLIC THOUSAND MILLIONS SIGN
FC5E ARABIC LIGATURE SHADDA WITH DAMMATAN ISOLATED FORM
FC5F ARABIC LIGATURE SHADDA WITH KASRATAN ISOLATED FORM
FC60 ARABIC LIGATURE SHADDA WITH FATHA ISOLATED FORM
FC61 ARABIC LIGATURE SHADDA WITH DAMMA ISOLATED FORM
FC62 ARABIC LIGATURE SHADDA WITH KASRA ISOLATED FORM
FC63 ARABIC LIGATURE SHADDA WITH SUPERSCRIPT ALEF ISOLATED FORM
FDFA ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM
FDFB ARABIC LIGATURE JALLAJALALOUHOU
FE70 ARABIC FATHATAN ISOLATED FORM
FE72 ARABIC DAMMATAN ISOLATED FORM
FE74 ARABIC KASRATAN ISOLATED FORM
FE76 ARABIC FATHA ISOLATED FORM
FE78 ARABIC DAMMA ISOLATED FORM
FE7A ARABIC KASRA ISOLATED FORM
FE7C ARABIC SHADDA ISOLATED FORM
FE7E ARABIC SUKUN ISOLATED FORM
1F130 SQUARED LATIN CAPITAL LETTER A - Z
1F150 NEGATIVE CIRCLED LATIN CAPITAL LETTER A -Z
1F170 NEGATIVE SQUARED LATIN CAPITAL LETTER A -Z

And these are the characters that are legal Unicode identifiers, but not
Perl ones.  This list has not changed for a long time.
00B7    MIDDLE DOT
0387    GREEK ANO TELEIA
1369    ETHIOPIC DIGIT 1 - 9
19DA    NEW TAI LUE THAM DIGIT ONE
2118    WEIERSTRASS ELLIPTIC FUNCTION
212E    ESTIMATED SYMBOL


  Commit: 978da58042b67e0e0af411e359214fc244059ed4
      
https://github.com/Perl/perl5/commit/978da58042b67e0e0af411e359214fc244059ed4
  Author: Karl Williamson <[email protected]>
  Date:   2026-01-29 (Thu, 29 Jan 2026)

  Changed paths:
    M toke.c

  Log Message:
  -----------
  parse_ident: Move all error handling to function end

This is in preparation for a few commits from now.


  Commit: 5278ec8d1266eeb2160fee1b826818c56cf4c055
      
https://github.com/Perl/perl5/commit/5278ec8d1266eeb2160fee1b826818c56cf4c055
  Author: Karl Williamson <[email protected]>
  Date:   2026-01-29 (Thu, 29 Jan 2026)

  Changed paths:
    M t/uni/parser.t

  Log Message:
  -----------
  uni/parser.t: Put test in a block


  Commit: 76a47333341cf13ea6eaffb35f97a5d3c59f6893
      
https://github.com/Perl/perl5/commit/76a47333341cf13ea6eaffb35f97a5d3c59f6893
  Author: Karl Williamson <[email protected]>
  Date:   2026-01-29 (Thu, 29 Jan 2026)

  Changed paths:
    M t/uni/parser.t

  Log Message:
  -----------
  uni/parser.t: Add TODO test

This will be fixed by the next commit


  Commit: 995502e182b573b3d9ccaa7a1cb31c88d8593ac4
      
https://github.com/Perl/perl5/commit/995502e182b573b3d9ccaa7a1cb31c88d8593ac4
  Author: Karl Williamson <[email protected]>
  Date:   2026-01-29 (Thu, 29 Jan 2026)

  Changed paths:
    M pod/perldelta.pod
    M pod/perldiag.pod
    M t/uni/parser.t
    M toke.c

  Log Message:
  -----------
  Give better error message for invalid \w in names

This commit uses the new Unicode property introduced a few commits prior
to check during parsing that identifiers do not contain one of the \w
characters that are not legal in names.


  Commit: dfb93b5326a9f13598520ef649a25840f0cbadf0
      
https://github.com/Perl/perl5/commit/dfb93b5326a9f13598520ef649a25840f0cbadf0
  Author: Karl Williamson <[email protected]>
  Date:   2026-01-29 (Thu, 29 Jan 2026)

  Changed paths:
    M toke.c

  Log Message:
  -----------
  S_scan_inputsymbol: Use parse_ident

instead of rolling our own crippled version which lacks the checking.


  Commit: b096a45fb4dfa34c598d485b2865209f8ad5dabf
      
https://github.com/Perl/perl5/commit/b096a45fb4dfa34c598d485b2865209f8ad5dabf
  Author: Karl Williamson <[email protected]>
  Date:   2026-01-29 (Thu, 29 Jan 2026)

  Changed paths:
    M toke.c

  Log Message:
  -----------
  S_checkcomma: Use parse_ident

instead of rolling our own crippled version which lacks the checking.


  Commit: 468809b54950e6a0016e1e1c9256d20b0c5bf303
      
https://github.com/Perl/perl5/commit/468809b54950e6a0016e1e1c9256d20b0c5bf303
  Author: Karl Williamson <[email protected]>
  Date:   2026-01-29 (Thu, 29 Jan 2026)

  Changed paths:
    M embed.fnc
    M embed.h
    M parser.h
    M proto.h
    M toke.c

  Log Message:
  -----------
  toke.c: Add parse_ident_msg()

This new function can be used to have parse_ident() return an error
message to its caller instead of dieing.  It turns out that regcomp.c
is in want of this functionality.


  Commit: a43dbe940cfed3e925e938554ce4d1d2061c03ab
      
https://github.com/Perl/perl5/commit/a43dbe940cfed3e925e938554ce4d1d2061c03ab
  Author: Karl Williamson <[email protected]>
  Date:   2026-01-29 (Thu, 29 Jan 2026)

  Changed paths:
    M embed.fnc
    M embed.h
    M pod/perldelta.pod
    M pod/perldiag.pod
    M pod/perlre.pod
    M proto.h
    M regcomp.c
    M regen/embed.pl
    M t/re/pat.t

  Log Message:
  -----------
  regcomp: Capture group names need be legal Unicode names

Previous comits have explicitly made sure that Perl identifiers are
legal Unicode names.  This extends that to regular expression group
(such as capturing) names.


Compare: https://github.com/Perl/perl5/compare/8d1ff091df83...a43dbe940cfe

To unsubscribe from these emails, change your notification settings at 
https://github.com/Perl/perl5/settings/notifications

Reply via email to