[Bug c++/106648] [C++23] P2071 - Named universal character escapes

2022-08-26 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106648

Jakub Jelinek  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #5 from Jakub Jelinek  ---
Implemented for GCC 13.

[Bug c++/106648] [C++23] P2071 - Named universal character escapes

2022-08-26 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106648

--- Comment #4 from CVS Commits  ---
The master branch has been updated by Jakub Jelinek :

https://gcc.gnu.org/g:eb4879ab9053085a59b8d1594ef76487948bba7e

commit r13-2212-geb4879ab9053085a59b8d1594ef76487948bba7e
Author: Jakub Jelinek 
Date:   Fri Aug 26 09:24:56 2022 +0200

c++: Implement C++23 P2071R2 - Named universal character escapes [PR106648]

The following patch implements the
C++23 P2071R2 - Named universal character escapes
paper to support \N{LATIN SMALL LETTER E} etc.
I've used Unicode 14.0, there are 144803 character name properties
(including the ones generated by Unicode NR1 and NR2 rules)
and correction/control/alternate aliases, together with zero terminators
that would be 3884745 bytes, which is clearly unacceptable for libcpp.
This patch instead contains a generator which from the UnicodeData.txt
and NameAliases.txt files emits a space optimized radix tree (208765
bytes long for 14.0), a single string literal dictionary (59418 bytes),
maximum name length (currently 88 chars) and two small helper arrays
for the NR1/NR2 name generation.
The radix tree needs 2 to 9 bytes per node, the exact format is
described in the generator program.  There could be ways to shrink
the dictionary size somewhat at the expense of slightly slower lookups.

Currently the patch implements strict matching (that is what is needed
to actually implement it on valid code) and Unicode UAX44-LM2 algorithm
loose matching to provide hints (that algorithm essentially ignores
hyphens in between two alphanumeric characters, spaces and underscores
(with one exception for hyphen) and does case insensitive matching).
In the attachment is a WIP patch that shows how to implement also
spellcheck.{h,cc} style discovery of misspellings, but I'll need to talk
to David Malcolm about it, as spellcheck.{h,cc} is in gcc/ subdir
(so the WIP incremental patch instead prints all the names to stderr).

2022-08-26  Jakub Jelinek  

PR c++/106648
libcpp/
* charset.cc: Implement C++23 P2071R2 - Named universal character
escapes.  Include uname2c.h.
(hangul_syllables, hangul_count): New variables.
(struct uname2c_data): New type.
(_cpp_uname2c, _cpp_uname2c_uax44_lm2): New functions.
(_cpp_valid_ucn): Use them.  Handle named universal character
escapes.
(convert_ucn): Adjust comment.
(convert_escape): Call convert_ucn even for \N.
(_cpp_interpret_identifier): Handle named universal character
escapes.
* lex.cc (get_bidi_ucn): Fix up function comment formatting.
(get_bidi_named): New function.
(forms_identifier_p, lex_string): Handle named universal character
escapes.
* makeuname2c.cc: New file.  Small parts copied from makeucnid.cc.
* uname2c.h: New generated file.
gcc/c-family/
* c-cppbuiltin.cc (c_cpp_builtins): Predefine
__cpp_named_character_escapes to 202207L.
gcc/testsuite/
* c-c++-common/cpp/named-universal-char-escape-1.c: New test.
* c-c++-common/cpp/named-universal-char-escape-2.c: New test.
* c-c++-common/cpp/named-universal-char-escape-3.c: New test.
* c-c++-common/cpp/named-universal-char-escape-4.c: New test.
* c-c++-common/Wbidi-chars-25.c: New test.
* gcc.dg/cpp/named-universal-char-escape-1.c: New test.
* gcc.dg/cpp/named-universal-char-escape-2.c: New test.
* g++.dg/cpp/named-universal-char-escape-1.C: New test.
* g++.dg/cpp/named-universal-char-escape-2.C: New test.
* g++.dg/cpp23/feat-cxx2b.C: Test __cpp_named_character_escapes.

[Bug c++/106648] [C++23] P2071 - Named universal character escapes

2022-08-20 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106648

Jakub Jelinek  changed:

   What|Removed |Added

  Attachment #53478|0   |1
is obsolete||
   Last reconfirmed||2022-08-20
 Status|UNCONFIRMED |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |jakub at gcc dot gnu.org
 Ever confirmed|0   |1

--- Comment #3 from Jakub Jelinek  ---
Created attachment 53483
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53483=edit
gcc13-pr106648.patch.xz

So far just lightly tested patch.

This handles did you mean hints using the Unicode UAX44-LM2 algorithm, but
doesn't offer fixits fot it (not sure if it is possible in libcpp) and doesn't
use spellcheck* stuff for fallback suggestions (the amount of strings and their
sizes are too huge to push them all into vector, but just walking all radix
tree nodes, computing current name as we go and at each codepoint (including
generated ones) compute Damerau-Levenshtein distance could work.  But
spellcheck.{cc,h} are in gcc/ ...

[Bug c++/106648] [C++23] P2071 - Named universal character escapes

2022-08-19 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106648

Jakub Jelinek  changed:

   What|Removed |Added

  Attachment #53471|0   |1
is obsolete||

--- Comment #2 from Jakub Jelinek  ---
Created attachment 53478
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53478=edit
gcc13-pr106648-wip.patch.xz

Updated WIP patch.  This has fixed bugs in the generator and a routine that
implements the mapping (tested on names from glibc UTF-8 and
https://eel.is/c++draft/lex.charset table so far), but isn't actually wired up
for \N{name}.
It does for now just exact matching, for fixit hints we'll need to do something
slightly different.

[Bug c++/106648] [C++23] P2071 - Named universal character escapes

2022-08-18 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106648

Jakub Jelinek  changed:

   What|Removed |Added

 CC||jakub at gcc dot gnu.org

--- Comment #1 from Jakub Jelinek  ---
Created attachment 53471
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53471=edit
makeuname2c.cc

I've so far written a generator of a space optimized radix tree for the Unicode
name to codepoint mapping (this would be libcpp/makeuname2c.cc),
but will need to write a consumer of those arrays to actually implement the
Unicode name to codepoint mapping.