Hi Tomas, Thanks for your prompt reply and spotting the right place. While I'm not good at C/C++ things, I'll try investigating this and, if possible, creating some patch to fix the issue. As the UTF-8 R on Windows is really exciting news to us in CJK locale, I'd like to do my best to help making the upcoming release a success.
I'll report on Bugzilla with more thetails first. Thanks for your support. Best, Yutani 2021年12月22日(水) 0:23 Tomas Kalibera <tomas.kalib...@gmail.com>: > > Hi Yutani, > > On 12/21/21 3:47 PM, Hiroaki Yutani wrote: > > Hi Tomas, > > > > Thank you very much for the detailed explanation! I think now I have a > > bit better understanding on how the things work; at least now I know I > > didn't understand the concept of "active code page". I'll follow your > > advice when I need to fix the packages that need some tweaks to handle > > UTF-8 properly. > > > > Sorry, I'd like to ask one more question related to locale. If I copy > > the following text and execute `read.csv("clipboard")`, it returns > > "uao" instead of "úáö" (the characters are transliterated). > > > > "col1","col2" > > "úáö","úáö" > > > > > > While this is probably the status quo (the same behavior on R 4.1) on > > Latin-1 encoding, things are worse on CJK locales. If I try, > > > > "col1","col2" > > "あ","い" > > > > I get the following error: > > > > > read.csv("clipboard") > > Error in type.convert.default(data[[i]], as.is = as.is[i], dec = dec, > > : > > invalid multibyte string at '<82><a0>' > > > > Is this supposed to work? It seems the characters are encoded as CP932 > > (my system locale) but marked as UTF-8. > > > > > x <- utils:::readClipboard() > > > x > > [1] "\"col1\",\"col2\"" "\"\x82\xa0\",\"\x82\xa2\"" > > > iconv(x, from = "CP932", to = "UTF-8") > > [1] "\"col1\",\"col2\"" "\"あ\",\"い\"" > > > > I read the source code of readClipboard() in > > src/library/utils/src/windows/util.c, but have no idea if there's > > anything that needs to be fixed. > > Yes, this should work. I can reproduce the problem on my system, the > clipboard apparently contains the Unicode characters, but R does not get > them correctly, and from my quick read, it is a bug in R. > > My guess is this is in connections.c, where we call > GetClipboardData(CF_TEXT). Perhaps if we used CF_UNICODETEXT, it would > work (or alternatively CF_TEXT but also CF_LOCALE to find out what is > the locale used, but CF_UNICODETEXT seems simpler). See > https://docs.microsoft.com/en-us/windows/win32/dataxchg/standard-clipboard-formats > > As you started looking at the code, would you like to try > debugging/fixing this? > > Best > Tomas > > > > > Best, > > Yutani > > > > 2021年12月21日(火) 17:26 Tomas Kalibera <tomas.kalib...@gmail.com>: > > > > > > > > > > > >> Hi Yutani, > >> > >> On 12/21/21 6:34 AM, Hiroaki Yutani wrote: > >>> Hi, > >>> > >>> I'm more than excited about the announcement about the upcoming UTF-8 > >>> R on Windows. Let me confirm my understanding. Is R 4.2 supposed to > >>> work on Windows with non-UTF-8 encoding as the system locale? I think > >>> this blog post indicates so (as this describes the older Windows than > >>> the UTF-8 era), but I'm not fully confident if I understand the > >>> details correctly. > >> R 4.2 will automatically use UTF-8 as the active code page (system > >> locale) and the C library encoding and the R current native encoding on > >> systems which allow this (recent Windows 10 and newer, Windows Server > >> 2022, etc). There is no way to opt-out from that, and of course no > >> reason to, either. It does not matter of what is the system locale set > >> in Windows for the whole system - these recent Windows allow individual > >> applications to override the system-wide setting to UTF-8, which is what > >> R does. Typically the system-wide setting will not be UTF-8, because > >> many applications will not work with that. > >> > >> On older systems, R 4.2 will run in some other system locale and the > >> same C library encoding and R current native encoding - the same system > >> default as R 4.1 would run on that system. So for some time, encoding > >> support for this in R will have to stay, but eventually will be removed. > >> But yes, R 4.2 is still supposed to work on such systems. > >> > >>> https://developer.r-project.org/Blog/public/2021/12/07/upcoming-changes-in-r-4.2-on-windows/index.html > >>> > >>> If so, I'm curious what the package authors should do when the locales > >>> are different between OS and R. For example (disclaimer: I don't > >>> intend to blame processx at all. Just for an example), the CRAN check > >>> on the processx package currently fails with this warning on R-devel > >>> Windows. > >>> > >>>> 1. UTF-8 in stdout (test-utf8.R:85:3) - Invalid multi-byte > >>>> character at end of stream ignored > >>> https://cran.r-project.org/web/checks/check_results_processx.html > >>> > >>> As far as I know, processx launches an external process and captures > >>> its output, and I suspect the problem is that the output of the > >>> process is encoded in non-UTF-8 while R assumes it's UTF-8. I > >>> experienced similar problems with other packages as well, which > >>> disappear if I switch the locale to the same one as the OS by > >>> Sys.setlocale(). So, I think it would be great if there's some > >>> guidance for the package authors on how to handle these properly. > >> Incidentally I've debugged this case and sent a detailed analysis to the > >> maintainer, so he knows about the problem. > >> > >> In short, you cannot assume in Windows that different applications use > >> the same system encoding. That is not true at least with the invention > >> of the fusion manifests which allow an application to switch to UTF-8 as > >> system encoding, which R does. So, when using an external application on > >> Windows, you need to know and respect a specific encoding used by that > >> application on input and output. > >> > >> As an example based on processx, you have an application which prints > >> its argument to standard output. If you do it this way: > >> > >> $ cat pr.c > >> #include <stdio.h> > >> #include <locale.h> > >> #include <string.h> > >> int main(int argc, char **argv) { > >> > >> printf("Locale set to: %s\n", setlocale(LC_ALL, "")); > >> int i; > >> for(i = 0; i < argc; i++) { > >> printf("Argument %d\n", i); > >> printf("%s\n", argv[i]); > >> for(int j = 0; j < strlen(argv[i]); j++) { > >> printf("byte[%d] is %x (%d)\n", i, (unsigned > >> char)argv[i][j], (unsigned char) > >> } > >> } > >> return 0; > >> } > >> > >> the argument and hence output will be in the current native encoding of > >> pr.c, because that's the encoding in which the argument will be received > >> from Windows, so by default the system locale encoding, so by default > >> not UTF-8 (on my system in Latin-1, as well as on CRAN check systems). > >> One should also only use such programs with characters representable in > >> Latin-1 on such systems. When you call such application from R with > >> UTF-8 as native encoding, Windows will automatically convert the > >> arguments to Latin-1. > >> > >> The old Windows way to avoid this problem is to use the wide-character > >> API (now UTF-16LE): > >> > >> $ cat prw.c > >> #include <stdio.h> > >> #include <locale.h> > >> #include <string.h> > >> > >> int wmain(int argc, wchar_t **argv) { > >> > >> int i; > >> for(i = 0; i < argc; i++) { > >> wprintf(L"Argument %d\n", i); > >> wprintf(argv[i]); > >> wprintf(L"\n"); > >> for(int j = 0; j < wcslen(argv[i]); j++) > >> wprintf(L"Word[%d] %x\n", j, > >> (unsigned)argv[i][j]); > >> } > >> return 0; > >> } > >> > >> When you call such program from R with UTF-8 as native encoding, Windows > >> will convert the arguments to UTF-16LE (so all characters will be > >> representable). But you need to write Windows-specific code for this. > >> > >> The new Windows way to avoid this problem is to use UTF-8 as the native > >> encoding via the fusion manifest, as R does. You can use the "pr.c" as > >> above, but with something like > >> > >> $ cat pr.rc > >> #include <windows.h> > >> CREATEPROCESS_MANIFEST_RESOURCE_ID RT_MANIFEST "pr.manifest" > >> > >> $ cat pr.manifest > >> <?xml version="1.0" encoding="UTF-8" standalone="yes"?> > >> <assembly xmlns="urn:schemas-microsoft-com:asm.v1" manifestVersion="1.0"> > >> <assemblyIdentity > >> version="1.0.0.0" > >> processorArchitecture="amd64" > >> name="pr.exe" > >> type="win32" > >> /> > >> <application> > >> <windowsSettings> > >> <activeCodePage > >> xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings">UTF-8</activeCodePage> > >> </windowsSettings> > >> </application> > >> </assembly> > >> > >> windres.exe -i pr.rc -o pr_rc.o > >> gcc -o pr pr.c pr_rc.o > >> > >> When you build the application this way, it will use UTF-8 as native > >> encoding, so when you call it from R (with UTF-8) as native encoding, no > >> input conversion will occur. However, when you do this, the output from > >> the application will also be in UTF-8. > >> > >> So, for applications you control, my recommendation would be to make > >> them use Unicode one of these two ways. Preferably the new one, with the > >> fusion manifest. Only if it were a Windows-only application, and had to > >> work on older Windows, then the wide-character version (but such apps > >> are probably not in R packages). > >> > >> When working with external applications you don't control, it is harder > >> - you need to know which encoding they are expecting and producing, in > >> whatever interface you use, and convert that, e.g. using iconv(). By the > >> interface I mean that e.g., the command-line arguments are converted by > >> Windows, but the input/output sent over a file/stream will not be. > >> > >> Of course, this works the other way around as well. If you were using R > >> with some other external applications expecting a different encoding, > >> you would need to handle that (by conversions). With applications you > >> control, it would make sense using this opportunity to switch to UTF-8. > >> But, in principle, you can use iconv() from R directly or indirectly to > >> convert input/output streams to/from a known encoding. > >> > >> I am happy to give more suggestions if there is interest, but for that > >> it would be useful to have a specific example (with processx, it is > >> clear what the options R, there the application is controlled by the > >> package). > >> > >> Best > >> Tomas > >>> Any suggestions? > >>> > >>> Best, > >>> Yutani > >>> > >>> ______________________________________________ > >>> R-devel@r-project.org mailing list > >>> https://stat.ethz.ch/mailman/listinfo/r-devel ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel