On Fri, 2025-09-12 at 16:45 +0100, Peter Damianov wrote:
> UTF-8 characters in diagnostic output (such as the warning emoji ⚠️
> used by fanalyzer) display as mojibake on Windows unless the utf8
> code page is being used
> 
> This patch adds UTF-8 to UTF-16 conversion when outputting to a
> console
> on Windows.
> 
> gcc/ChangeLog:
>       * pretty-print.cc (decode_utf8_char): Move forward
> declaration.
>       (utf8_to_utf16): New function to convert UTF-8 to UTF-16.
>       (is_console_handle): New function to detect Windows console
> handles.
>       (write_all): Add UTF-8 to UTF-16 conversion for console
> output,
>       falling back to WriteFile for ASCII strings and regular
> files.
> 
> Signed-off-by: Peter Damianov <peter0...@disroot.org>
> ---
> v2:
> Fix linux build by moving decode_utf8_char outside of ifdef
> Keep form feed
> 
>  gcc/pretty-print.cc | 132
> +++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 129 insertions(+), 3 deletions(-)
> 
> diff --git a/gcc/pretty-print.cc b/gcc/pretty-print.cc
> index d79a8282cfb..c29e15a41f3 100644
> --- a/gcc/pretty-print.cc
> +++ b/gcc/pretty-print.cc
> @@ -38,11 +38,18 @@ along with GCC; see the file COPYING3.  If not
> see
>  #include <iconv.h>
>  #endif
>  
> +static int
> +decode_utf8_char (const unsigned char *, size_t len, unsigned int
> *);
> +
>  #ifdef __MINGW32__
>  
>  /* Replacement for fputs() that handles ANSI escape codes on Windows
> NT.
>     Contributed by: Liu Hao (lh_mouse at 126 dot com)
>  
> +   Extended by: Peter Damianov
> +   Converts UTF-8 to UTF-16 if outputting to a console, so that
> emojis and
> +   various other unicode characters don't get mojibak'd.
> +
>     XXX: This file is compiled into libcommon.a that will be self-
> contained.
>       It looks like that these functions can be put nowhere else. 
> */
>  
> @@ -50,11 +57,132 @@ along with GCC; see the file COPYING3.  If not
> see
>  #define WIN32_LEAN_AND_MEAN 1
>  #include <windows.h>
>  
> +/* Convert UTF-8 string to UTF-16.
> +   Returns true if conversion was performed, false if string is pure
> ASCII.
> +
> +   If the string contains only ASCII characters, returns false
> +   without allocating any memory.  Otherwise, a buffer that the
> caller
> +   must free is allocated and the string is converted into it.  */
> +static bool
> +utf8_to_utf16 (const char *utf8_str, size_t utf8_len, wchar_t
> **utf16_str,
> +            size_t *utf16_len)

Thanks for the patch.

I notice that libcpp/charset.cc defines a function convert_utf8_utf16
(albeit currently static).  Is there a way that this could be reused,
rather than adding a 2nd implementation?

[...snip...]

Sorry, I confess I don't know enough about Windows compat that I can't
comment on the rest of the patch.  If it fixes things on Windows and
doesn't break other OSes, that's good, I suppose :/

Hope this is constructive
Dave

Reply via email to