On Fri, 2025-09-12 at 16:45 +0100, Peter Damianov wrote: > UTF-8 characters in diagnostic output (such as the warning emoji ⚠️ > used by fanalyzer) display as mojibake on Windows unless the utf8 > code page is being used > > This patch adds UTF-8 to UTF-16 conversion when outputting to a > console > on Windows. > > gcc/ChangeLog: > * pretty-print.cc (decode_utf8_char): Move forward > declaration. > (utf8_to_utf16): New function to convert UTF-8 to UTF-16. > (is_console_handle): New function to detect Windows console > handles. > (write_all): Add UTF-8 to UTF-16 conversion for console > output, > falling back to WriteFile for ASCII strings and regular > files. > > Signed-off-by: Peter Damianov <peter0...@disroot.org> > --- > v2: > Fix linux build by moving decode_utf8_char outside of ifdef > Keep form feed > > gcc/pretty-print.cc | 132 > +++++++++++++++++++++++++++++++++++++++++++- > 1 file changed, 129 insertions(+), 3 deletions(-) > > diff --git a/gcc/pretty-print.cc b/gcc/pretty-print.cc > index d79a8282cfb..c29e15a41f3 100644 > --- a/gcc/pretty-print.cc > +++ b/gcc/pretty-print.cc > @@ -38,11 +38,18 @@ along with GCC; see the file COPYING3. If not > see > #include <iconv.h> > #endif > > +static int > +decode_utf8_char (const unsigned char *, size_t len, unsigned int > *); > + > #ifdef __MINGW32__ > > /* Replacement for fputs() that handles ANSI escape codes on Windows > NT. > Contributed by: Liu Hao (lh_mouse at 126 dot com) > > + Extended by: Peter Damianov > + Converts UTF-8 to UTF-16 if outputting to a console, so that > emojis and > + various other unicode characters don't get mojibak'd. > + > XXX: This file is compiled into libcommon.a that will be self- > contained. > It looks like that these functions can be put nowhere else. > */ > > @@ -50,11 +57,132 @@ along with GCC; see the file COPYING3. If not > see > #define WIN32_LEAN_AND_MEAN 1 > #include <windows.h> > > +/* Convert UTF-8 string to UTF-16. > + Returns true if conversion was performed, false if string is pure > ASCII. > + > + If the string contains only ASCII characters, returns false > + without allocating any memory. Otherwise, a buffer that the > caller > + must free is allocated and the string is converted into it. */ > +static bool > +utf8_to_utf16 (const char *utf8_str, size_t utf8_len, wchar_t > **utf16_str, > + size_t *utf16_len)
Thanks for the patch. I notice that libcpp/charset.cc defines a function convert_utf8_utf16 (albeit currently static). Is there a way that this could be reused, rather than adding a 2nd implementation? [...snip...] Sorry, I confess I don't know enough about Windows compat that I can't comment on the rest of the patch. If it fixes things on Windows and doesn't break other OSes, that's good, I suppose :/ Hope this is constructive Dave