Hi Yutani,

On 12/21/21 6:34 AM, Hiroaki Yutani wrote:
Hi,

I'm more than excited about the announcement about the upcoming UTF-8
R on Windows. Let me confirm my understanding. Is R 4.2 supposed to
work on Windows with non-UTF-8 encoding as the system locale? I think
this blog post indicates so (as this describes the older Windows than
the UTF-8 era), but I'm not fully confident if I understand the
details correctly.

R 4.2 will automatically use UTF-8 as the active code page (system locale) and the C library encoding and the R current native encoding on systems which allow this (recent Windows 10 and newer, Windows Server 2022, etc). There is no way to opt-out from that, and of course no reason to, either. It does not matter of what is the system locale set in Windows for the whole system - these recent Windows allow individual applications to override the system-wide setting to UTF-8, which is what R does. Typically the system-wide setting will not be UTF-8, because many applications will not work with that.

On older systems, R 4.2 will run in some other system locale and the same C library encoding and R current native encoding - the same system default as R 4.1 would run on that system. So for some time, encoding support for this in R will have to stay, but eventually will be removed. But yes, R 4.2 is still supposed to work on such systems.

https://developer.r-project.org/Blog/public/2021/12/07/upcoming-changes-in-r-4.2-on-windows/index.html

If so, I'm curious what the package authors should do when the locales
are different between OS and R. For example (disclaimer: I don't
intend to blame processx at all. Just for an example), the CRAN check
on the processx package currently fails with this warning on R-devel
Windows.

     1. UTF-8 in stdout (test-utf8.R:85:3) - Invalid multi-byte character at 
end of stream ignored
https://cran.r-project.org/web/checks/check_results_processx.html

As far as I know, processx launches an external process and captures
its output, and I suspect the problem is that the output of the
process is encoded in non-UTF-8 while R assumes it's UTF-8. I
experienced similar problems with other packages as well, which
disappear if I switch the locale to the same one as the OS by
Sys.setlocale(). So, I think it would be great if there's some
guidance for the package authors on how to handle these properly.

Incidentally I've debugged this case and sent a detailed analysis to the maintainer, so he knows about the problem.

In short, you cannot assume in Windows that different applications use the same system encoding. That is not true at least with the invention of the fusion manifests which allow an application to switch to UTF-8 as system encoding, which R does. So, when using an external application on Windows, you need to know and respect a specific encoding used by that application on input and output.

As an example based on processx, you have an application which prints its argument to standard output. If you do it this way:

$ cat pr.c
#include <stdio.h>
#include <locale.h>
#include <string.h>
int main(int argc, char **argv) {

        printf("Locale set to: %s\n", setlocale(LC_ALL, ""));
        int i;
        for(i = 0; i < argc; i++) {
                printf("Argument %d\n", i);
                printf("%s\n", argv[i]);
                for(int j = 0; j < strlen(argv[i]); j++) {
                        printf("byte[%d] is %x (%d)\n", i, (unsigned char)argv[i][j], (unsigned char)
                }
        }
        return 0;
}

the argument and hence output will be in the current native encoding of pr.c, because that's the encoding in which the argument will be received from Windows, so by default the system locale encoding, so by default not UTF-8 (on my system in Latin-1, as well as on CRAN check systems). One should also only use such programs with characters representable in Latin-1 on such systems. When you call such application from R with UTF-8 as native encoding, Windows will automatically convert the arguments to Latin-1.

The old Windows way to avoid this problem is to use the wide-character API (now UTF-16LE):

$ cat prw.c
#include <stdio.h>
#include <locale.h>
#include <string.h>

int wmain(int argc, wchar_t **argv) {

        int i;
        for(i = 0; i < argc; i++) {
                wprintf(L"Argument %d\n", i);
                wprintf(argv[i]);
                wprintf(L"\n");
                for(int j = 0; j < wcslen(argv[i]); j++)
                        wprintf(L"Word[%d] %x\n", j, (unsigned)argv[i][j]);
        }
        return 0;
}

When you call such program from R with UTF-8 as native encoding, Windows will convert the arguments to UTF-16LE (so all characters will be representable). But you need to write Windows-specific code for this.

The new Windows way to avoid this problem is to use UTF-8 as the native encoding via the fusion manifest, as R does. You can use the "pr.c" as above, but with something like

$ cat pr.rc
#include <windows.h>
CREATEPROCESS_MANIFEST_RESOURCE_ID RT_MANIFEST "pr.manifest"

$ cat pr.manifest
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<assembly xmlns="urn:schemas-microsoft-com:asm.v1" manifestVersion="1.0">
<assemblyIdentity
    version="1.0.0.0"
    processorArchitecture="amd64"
    name="pr.exe"
    type="win32"
/>
<application>
  <windowsSettings>
    <activeCodePage xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings";>UTF-8</activeCodePage>
  </windowsSettings>
</application>
</assembly>

windres.exe -i pr.rc -o pr_rc.o
gcc -o pr pr.c pr_rc.o

When you build the application this way, it will use UTF-8 as native encoding, so when you call it from R (with UTF-8) as native encoding, no input conversion will occur. However, when you do this, the output from the application will also be in UTF-8.

So, for applications you control, my recommendation would be to make them use Unicode one of these two ways. Preferably the new one, with the fusion manifest. Only if it were a Windows-only application, and had to work on older Windows, then the wide-character version (but such apps are probably not in R packages).

When working with external applications you don't control, it is harder - you need to know which encoding they are expecting and producing, in whatever interface you use, and convert that, e.g. using iconv(). By the interface I mean that e.g., the command-line arguments are converted by Windows, but the input/output sent over a file/stream will not be.

Of course, this works the other way around as well. If you were using R with some other external applications expecting a different encoding, you would need to handle that (by conversions). With applications you control, it would make sense using this opportunity to switch to UTF-8. But, in principle, you can use iconv() from R directly or indirectly to convert input/output streams to/from a known encoding.

I am happy to give more suggestions if there is interest, but for that it would be useful to have a specific example (with processx, it is clear what the options R, there the application is controlled by the package).

Best
Tomas

Any suggestions?

Best,
Yutani

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to