Re: [Rd] R on Windows with UCRT and the system encoding

Tomas Kalibera Tue, 21 Dec 2021 00:26:51 -0800

Hi Yutani,

On 12/21/21 6:34 AM, Hiroaki Yutani wrote:

Hi,


I'm more than excited about the announcement about the upcoming UTF-8
R on Windows. Let me confirm my understanding. Is R 4.2 supposed to
work on Windows with non-UTF-8 encoding as the system locale? I think
this blog post indicates so (as this describes the older Windows than
the UTF-8 era), but I'm not fully confident if I understand the
details correctly.

R 4.2 will automatically use UTF-8 as the active code page (systemlocale) and the C library encoding and the R current native encoding onsystems which allow this (recent Windows 10 and newer, Windows Server2022, etc). There is no way to opt-out from that, and of course noreason to, either. It does not matter of what is the system locale setin Windows for the whole system - these recent Windows allow individualapplications to override the system-wide setting to UTF-8, which is whatR does. Typically the system-wide setting will not be UTF-8, becausemany applications will not work with that.

On older systems, R 4.2 will run in some other system locale and thesame C library encoding and R current native encoding - the same systemdefault as R 4.1 would run on that system. So for some time, encodingsupport for this in R will have to stay, but eventually will be removed.But yes, R 4.2 is still supposed to work on such systems.

https://developer.r-project.org/Blog/public/2021/12/07/upcoming-changes-in-r-4.2-on-windows/index.html

If so, I'm curious what the package authors should do when the locales
are different between OS and R. For example (disclaimer: I don't
intend to blame processx at all. Just for an example), the CRAN check
on the processx package currently fails with this warning on R-devel
Windows.

     1. UTF-8 in stdout (test-utf8.R:85:3) - Invalid multi-byte character at 
end of stream ignored

https://cran.r-project.org/web/checks/check_results_processx.html

As far as I know, processx launches an external process and captures
its output, and I suspect the problem is that the output of the
process is encoded in non-UTF-8 while R assumes it's UTF-8. I
experienced similar problems with other packages as well, which
disappear if I switch the locale to the same one as the OS by
Sys.setlocale(). So, I think it would be great if there's some
guidance for the package authors on how to handle these properly.

Incidentally I've debugged this case and sent a detailed analysis to themaintainer, so he knows about the problem.

In short, you cannot assume in Windows that different applications usethe same system encoding. That is not true at least with the inventionof the fusion manifests which allow an application to switch to UTF-8 assystem encoding, which R does. So, when using an external application onWindows, you need to know and respect a specific encoding used by thatapplication on input and output.

As an example based on processx, you have an application which printsits argument to standard output. If you do it this way:


$ cat pr.c
#include <stdio.h>
#include <locale.h>
#include <string.h>
int main(int argc, char **argv) {

        printf("Locale set to: %s\n", setlocale(LC_ALL, ""));
        int i;
        for(i = 0; i < argc; i++) {
                printf("Argument %d\n", i);
                printf("%s\n", argv[i]);
                for(int j = 0; j < strlen(argv[i]); j++) {

printf("byte[%d] is %x (%d)\n", i, (unsignedchar)argv[i][j], (unsigned char)

                }
        }
        return 0;
}

the argument and hence output will be in the current native encoding ofpr.c, because that's the encoding in which the argument will be receivedfrom Windows, so by default the system locale encoding, so by defaultnot UTF-8 (on my system in Latin-1, as well as on CRAN check systems).One should also only use such programs with characters representable inLatin-1 on such systems. When you call such application from R withUTF-8 as native encoding, Windows will automatically convert thearguments to Latin-1.

The old Windows way to avoid this problem is to use the wide-characterAPI (now UTF-16LE):


$ cat prw.c
#include <stdio.h>
#include <locale.h>
#include <string.h>

int wmain(int argc, wchar_t **argv) {

        int i;
        for(i = 0; i < argc; i++) {
                wprintf(L"Argument %d\n", i);
                wprintf(argv[i]);
                wprintf(L"\n");
                for(int j = 0; j < wcslen(argv[i]); j++)

wprintf(L"Word[%d] %x\n", j,(unsigned)argv[i][j]);

        }
        return 0;
}

When you call such program from R with UTF-8 as native encoding, Windowswill convert the arguments to UTF-16LE (so all characters will berepresentable). But you need to write Windows-specific code for this.

The new Windows way to avoid this problem is to use UTF-8 as the nativeencoding via the fusion manifest, as R does. You can use the "pr.c" asabove, but with something like


$ cat pr.rc
#include <windows.h>
CREATEPROCESS_MANIFEST_RESOURCE_ID RT_MANIFEST "pr.manifest"

$ cat pr.manifest
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<assembly xmlns="urn:schemas-microsoft-com:asm.v1" manifestVersion="1.0">
<assemblyIdentity
    version="1.0.0.0"
    processorArchitecture="amd64"
    name="pr.exe"
    type="win32"
/>
<application>
  <windowsSettings>

<activeCodePagexmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings";>UTF-8</activeCodePage>

  </windowsSettings>
</application>
</assembly>

windres.exe -i pr.rc -o pr_rc.o
gcc -o pr pr.c pr_rc.o

When you build the application this way, it will use UTF-8 as nativeencoding, so when you call it from R (with UTF-8) as native encoding, noinput conversion will occur. However, when you do this, the output fromthe application will also be in UTF-8.

So, for applications you control, my recommendation would be to makethem use Unicode one of these two ways. Preferably the new one, with thefusion manifest. Only if it were a Windows-only application, and had towork on older Windows, then the wide-character version (but such appsare probably not in R packages).

When working with external applications you don't control, it is harder- you need to know which encoding they are expecting and producing, inwhatever interface you use, and convert that, e.g. using iconv(). By theinterface I mean that e.g., the command-line arguments are converted byWindows, but the input/output sent over a file/stream will not be.

Of course, this works the other way around as well. If you were using Rwith some other external applications expecting a different encoding,you would need to handle that (by conversions). With applications youcontrol, it would make sense using this opportunity to switch to UTF-8.But, in principle, you can use iconv() from R directly or indirectly toconvert input/output streams to/from a known encoding.

I am happy to give more suggestions if there is interest, but for thatit would be useful to have a specific example (with processx, it isclear what the options R, there the application is controlled by thepackage).


Best
Tomas


Any suggestions?

Best,
Yutani

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] R on Windows with UCRT and the system encoding

Reply via email to