On Mon, 8 Jan 2024 at 01:19, Jonathan Wakely <[email protected]> wrote:
>
> I decided to push this now, not wait for the morning.
>
> This is mostly the same as V2, but adds to the contrib/unicode/README as
> suggested by Lewis, and avoids a trailing whitespace character in the
> generated header.
>
> Tested x86_64-linux and aarch64-linux. Pushed to trunk.
>
> -- >8 --
>
>
> This implements the requirements in the following proposals, which
> dictate how std::format deals with non-ASCII strings:
> https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1868r1.html
> https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2572r1.html
> https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2675r1.pdf
>
> There are two parts to this. The width estimation for strings must only
> count the width of the first character in an extended grapheme cluster.
> That requires implementing the algorithm for detecting cluster breaks,
> which requires a number of lookup tables of the grapheme cluster break
> properties (and Indic_Conjunct_Break and Extended_Pictographic
> properties) of every code point. Additionally, some characters have a
> field width of 2, which requires another lookup table of field widths
> for every code point. The tables added in this commit do not contain
> entries for every code point from 0 to 0x10FFFF as that would be very
> inefficient and use too much memory. Instead the tables only contain the
> code points that form an "edge" for a property, omitting all the code
> points that have the same property as the preceding one. We can use a
> binary search to find the closest code point in the table that is not
> greater than the one we're looking for.
>
> The tables are generated by a new Python script added to the
> contrib/unicode directory, and a new data file downloaded from the
> Unicode Consortium website.
>
> The rules for extended grapheme cluster breaking are implemented for the
> latest Unicode standard, version 15.1.0.
>
> libstdc++-v3/ChangeLog:
>
> * include/Makefile.am: Add new headers.
> * include/Makefile.in: Regenerate.
> * include/bits/unicode.h: New file.
> * include/bits/unicode-data.h: New file.
> * include/std/format: Include <bits/unicode.h>.
> (__literal_encoding_is_utf8): Move to <bits/unicode.h>.
> (_Spec::_M_fill): Change type to char32_t.
> (_Spec::_M_parse_fill_and_align): Read a Unicode scalar value
> instead of a single character.
> (__write_padded): Change __fill_char parameter to char32_t and
> encode it into the output.
> (__formatter_str::format): Use new __unicode::__field_width and
> __unicode::__truncate functions.
> * include/std/ostream: Adjust namespace qualification for
> __literal_encoding_is_utf8.
> * include/std/print: Likewise.
> * src/c++23/print.cc: Add [[unlikely]] attribute to error path.
> * testsuite/ext/unicode/view.cc: New test.
> * testsuite/std/format/functions/format.cc: Add missing examples
> from the standard demonstrating alignment with non-ASCII
> characters. Add examples checking correct handling of extended
> grapheme clusters.
>
> contrib/ChangeLog:
>
> * unicode/README: Add notes about generating libstdc++ tables.
> * unicode/GraphemeBreakProperty.txt: New file.
> * unicode/emoji-data.txt: New file.
> * unicode/gen_libstdcxx_unicode_data.py: New file.
> ---
While writing some more tests I realised I'd forgotten to finish this
function, and had left it as a copy&paste from __field_width(char32_t)
above:
> + constexpr bool
> + __is_extended_pictographic(char32_t __c)
> + {
> + if (__c < __xpicto_edges[0]) [[likely]]
> + return 1;
> +
> + auto* __p = std::upper_bound(__xpicto_edges, std::end(__xpicto_edges),
> __c);
> + return (__p - __xpicto_edges) % 2 + 1;
> + }
It should be:
constexpr bool
__is_extended_pictographic(char32_t __c)
{
if (__c < __xpicto_edges[0]) [[likely]]
return false;
auto* __p = std::upper_bound(__xpicto_edges, std::end(__xpicto_edges), __c);
return (__p - __xpicto_edges) % 2;
}
I'll push a fix for that (and add my new tests) tomorrow.