Re: [PATCH] Use UTF-8 active code page for Windows host.

Costas Argyris Sun, 19 Mar 2023 06:43:46 -0700

Does this support require Make to be linked against the UCRT
run-time library, or does it also work with the older MSVCRT?

I haven't found anything explicitly mentioned about this in the official
doc:

https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page

Also, it is possible to apply the manifest even post-compilation of the
executable, using mt.exe (MS standard workflow) on it, so it shouldn't
matter if it is linked against either one because it can be done even
after the link phase.    Not sure if that's a convincing argument though.

If Make is built with MSVC, does it have to be built with some new
enough version of Studio to have the necessary run-time support
for this feature, or any version will do?

I haven't built Make with MSVC at all (patch is focused on building with
GNU tools) but again there is no mention of this in the official doc above.

It is just another case of using a manifest file, where this time the
manifest
is used to set the active code page of the process to UTF-8.

In fact, the manifest can be embedded into the target executable even
post-compilation, using mt.exe, so I don't think a recent version of VS
is a requirement to build properly.

Does using UTF-8 as the active page in Make mean that locale-dependent
C library functions will behave as expected?

I think so.    Here is the relevant doc I found:

https://learn.microsoft.com/en-us/cpp/text/locales-and-code-pages?view=msvc-170

where the interesting bits are those where "operating system" is mentioned,
like:

"Also, the run-time library might obtain and use the value of the operating
system
code page, which is constant for the duration of the program's execution."

I believe with setting the active code page of the process to UTF-8 we
are effectively forcing the process to think that the operating system
code page is UTF-8, as far as that process is concerned.

Did you try running Make with this manifest on older Windows systems,
like Windows 8.1 or 7?  It is important to make sure this manifest doesn't
preclude Make from running on those older systems, even though the
UTF-8 feature will then be unavailable.

I did not try as I don't have access to such systems, but it seems pretty
clear from the doc that this should not be a problem:

"You can declare this property and target/run on earlier Windows builds,
but you must handle legacy code page detection and conversion as usual.
With a minimum target version of Windows Version 1903, the process code
page will always be UTF-8 so legacy code page detection and conversion can
be avoided."

It sounds like it will simply not use UTF-8, meaning that any UTF-8 input
would still cause Make to break, but that would happen anyway with such
input.    Based on the above, it shouldn't change existing behavior in these
older systems, and certainly not stop Make from running on them.

When Make invokes other programs (which it does quite a lot ;-),
and passes command-line arguments to it with non-ASCII characters,
what will happen to those non-ASCII characters?

I think your expectation is correct. Windows seems to be converting the
UTF-8
encoded strings to the current ANSI codepage, therefore allowing non-ASCII
characters (that are part of that ANSI codepage) to be propagated to the
non-UTF-8 program.

Below are some experiments to show this.

In what follows, 'mingw32-make' is today's (unpatched) Make for Windows, as
found in a typical mingw build distribution.    Since it is unpatched, it
is using
the local ANSI codepage which is windows-1252 in my machine.

'make' is the patched version which uses the UTF-8 codepage.

Makefile 'windows-1252-non-ascii.mk' is encoded in 1252 and has content:

hello :
<TAB>gcc ©\src.c -o ©\src.exe

where the (extended ASCII) Copyright sign has been used (0xA9 in 1252).

Makefile 'utf8.mk' has the same content but is encoded in UTF-8, so the
Copyright sign is represented as 0xC2 0xA9 (two-byte UTF-8 sequence,
confirmed by looking through hex editor).

With the unpatched Make that uses the local codepage:

mingw32-make -f windows-1252-non-ascii.mk

works fine and produces the .exe under the copyright folder (current
behavior).

mingw32-make -f utf8.mk

breaks because the unpatched make can't understand the UTF-8 file
(expected).

With the patched Make that uses the UTF-8 codepage:

make -f windows-1252-non-ascii.mk

breaks because Make expects UTF-8 and we are feeding it with a 1252 file.

make -f utf8.mk

works fine and produces the .exe under the copyright folder.

I believe this last case is the one that answers your question:

Make (now working in UTF-8) calls gcc (working in 1252) with some UTF-8
encoded arguments.    gcc has no problem doing the compilation and
producing the executable under the Copyright folder, which suggests that
Windows did indeed convert the UTF-8 arguments into gcc's codepage (1252),
and because the Copyright sign does exist in 1252 the conversion was
successful, allowing gcc to run.

So it doesn't look like this change is disabling non-ASCII argument support
in programs called by Make.    They maintain whatever characters are
available in
their local codepage (1252 in this case), and since UTF-8 covers the entire
Unicode
spectrum, it doesn't seem like we are losing any currently
working scenarios.

So this feature will only be complete when the programs invoked by Make are
also UTF-8 capable.

I agree.    Make is the gateway to many programs and it can't control what
they
do and how they work internally.    But by working in UTF-8 itself, it is
at least
giving those other programs the chance to work in UTF-8, so it's not
blocking them.
In other words, if those programs could work in UTF-8 but not Make itself,
they
wouldn't even get the chance to run because Make would break first before
even
calling them.

Also, since the above experiments seem to suggest that we are not dropping
existing support for non-ASCII characters in programs called by Make, it
seems
like a clear step forwards in terms of Unicode support on Windows.

But you are right, if those programs themselves don't support UTF-8, they
are
just going to error out when faced with full UTF-8 arguments (that don't
map to
anything in their legacy encoding), but that will be an error on their
side, not Make.

It is OK to alter the general Makefile.am files, but please note that
Make for Windows is canonically built using the build_w32.bat batch
file; building using the Unix configury stuff is an option not
currently directly supported by the project (although I believe it
does work).

I cross-compiled Make for Windows using gcc (mingw-w64) and the
autoconf + automake + configure + make approach, so it clearly worked
for me, but I didn't imagine that this wasn't the standard way to build for
Windows host.

Does this mean that all builds of Make found in the various build
distributions
of the GNU toolchain for Windows (like mingw32-make.exe in the examples
above) were necessarily built using build_w32.bat?   I suppose not
necessarily
because these build distributors could be doing something similar to what I
did
using the Unix-like build approach.    If yes, then they could benefit from
the
patch as-is.    If they do it using the build_w32.bat file, then they won't
see a
difference until the patch gets applied there as well.

Since build_w32.bat is a Windows-specific batch file, does this rule out
cross-compilation as a canonical way to build Make for Windows?

Assuming all questions are answered first, would it be OK to work on the
build_w32.bat changes in a second separate patch, and keep the first one
focused only on the Unix-like build process?

Thanks,
Costas

On Sun, 19 Mar 2023 at 06:44, Eli Zaretskii <e...@gnu.org> wrote:

> > From: Costas Argyris <costas.argy...@gmail.com>
> > Date: Sat, 18 Mar 2023 16:37:20 +0000
> >
> > This is a proposed patch to enable UTF-8 support in GNU Make running on
> Windows host.
>
> Thanks.
>
> > Today, the make process on Windows is using the legacy system code page
> because of the "A" functions
> > called in the source code.    This means that any UTF-8 input to make on
> Windows will break.    A few
> > examples follow:
>
> Yes, this misfeature of using the system codepage is known, together
> with the consequences.
>
> > The attached patch incorporates the UTF-8 manifest into the build
> process of GNU Make when hosted on
> > Windows, and forces the built executable to use UTF-8 as its active code
> page, solving all problems shown
> > above because this has a global effect in the process.    All existing
> "A" calls use the UTF-8 code page now
> > instead of the legacy one.    This is the relevant Microsoft doc:
> >
> >
> https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page
> >
> > With the patch, after building make, the above cases now work on Windows:
> >
> > ######################
> > C:\Users\cargyris\temp>cat utf8Makefile.mk
> > hello :
> >         @echo ﹏
> >         @echo ❎
> > C:\Users\cargyris\temp>make -f utf8Makefile.mk
> > ﹏
> > ❎
> >
> > C:\Users\cargyris\temp>make -f ❎\utf8Makefile.mk
> > ﹏
> > ❎
> >
> > C:\Users\cargyris\temp>cd ❎
> >
> > C:\Users\cargyris\temp\❎>make -f utf8Makefile.mk
> > ﹏
> > ❎
> >
> > C:\Users\cargyris\temp\❎>make -f ❎\utf8Makefile.mk
> > ﹏
> > ❎
> > ######################
> >
> > This change might also fix other existing issues on Windows having to do
> with filenames and paths, but I
> > can't point at something particular right now.
> >
> > Would a patch like that be considered?
>
> Yes, of course.
>
> However, we need to understand better the conditions under which the
> UTF-8 support in Make will be activated, and the consequences of
> activating it.  Here are some specific questions, based on initial
> thinking about this:
>
>   . Does this support require Make to be linked against the UCRT
>     run-time library, or does it also work with the older MSVCRT?  If
>     Make is built with MSVC, does it have to be built with some new
>     enough version of Studio to have the necessary run-time support
>     for this feature, or any version will do?
>
>   . Does using UTF-8 as the active page in Make mean that
>     locale-dependent C library functions will behave as expected?  For
>     example, what happens with character classification functions such
>     as isalpha and isdigit, and what happens with functions related to
>     letter-case, such as tolower and stricmp -- will they perform
>     correctly with characters in the entire Unicode range?  (This
>     might be related to the first question above.)
>
>   . Did you try running Make with this manifest on older Windows
>     systems, like Windows 8.1 or 7?  It is important to make sure this
>     manifest doesn't preclude Make from running on those older
>     systems, even though the UTF-8 feature will then be unavailable.
>
>   . When Make invokes other programs (which it does quite a lot ;-),
>     and passes command-line arguments to it with non-ASCII characters,
>     what will happen to those non-ASCII characters?  I'm guessing that
>     if the program also has such a manifest, it will get the UTF-8
>     encoded strings verbatim, but what if it doesn't have such a
>     manifest?  (The vast majority of the programs Make invokes
>     nowadays don't have such manifests.)  Will Windows convert the
>     UTF-8 encoded strings into the system codepage, or will the
>     program get UTF-8 regardless of whether it can or cannot handle
>     them?  If the latter, it will become impossible to use non-ASCII
>     strings and file names with such programs even if those non-ASCII
>     characters can be represented using the current system ANSI
>     codepage, because most programs Make invokes on Windows don't
>     support UTF-8.  Your examples invoked only the built-in commands
>     of cmd.exe, but what happens if you instead invoke, say, GCC, and
>     pass it a non-ASCII file name, including a file name which cannot
>     be represented in the current ANSI codepage?
>
>   . Even if the answer to the previous question is, as I expect, that
>     Windows will convert UTF-8 encoded strings to the current ANSI
>     codepage, it is important to understand that with the UTF-8 active
>     codepage enabled Make will still be unable to invoke programs with
>     UTF-8 encoded strings if those programs don't have the same UTF-8
>     active codepage enabled, except if the non-ASCII characters in
>     those strings can be represented by the current ANSI codepage.  So
>     this feature will only be complete when the programs invoked by
>     Make are also UTF-8 capable.
>
> A specific comment on your patch:
>
> > --- a/Makefile.am
> > +++ b/Makefile.am
> > @@ -46,6 +46,8 @@ w32_SRCS =  src/w32/pathstuff.c src/w32/w32os.c
> src/w32/compat/dirent.c \
> >               src/w32/subproc/misc.c src/w32/subproc/proc.h \
> >               src/w32/subproc/sub_proc.c src/w32/subproc/w32err.c
>
> It is OK to alter the general Makefile.am files, but please note that
> Make for Windows is canonically built using the build_w32.bat batch
> file; building using the Unix configury stuff is an option not
> currently directly supported by the project (although I believe it
> does work).  So to be effective, these changes need to make the
> commands in that batch file to run windres as well, and the file
> README.W32 should mention the manifest installation alongside the
> executable.  Note that build_w32.bat supports both MinGW GCC and MSVC
> builds, in two separate parts, and so the changes should affect both
> parts.
>
> Thanks again for your interest in GNU Make.
>

Re: [PATCH] Use UTF-8 active code page for Windows host.

Reply via email to