This particular issue has a simple fix. Currently, the "R_check_locale" function includes the following code starting at line 244 in src/main/platform.c:
#ifdef Win32 { char *ctype = setlocale(LC_CTYPE, NULL), *p; p = strrchr(ctype, '.'); if (p && isdigit(p[1])) localeCP = atoi(p+1); else localeCP = 0; /* Not 100% correct, but CP1252 is a superset */ known_to_be_latin1 = latin1locale = (localeCP == 1252); } #endif The "1252" should be "28591"; see https://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx . > Daniel Possenriede <mailto:possenri...@gmail.com> > September 14, 2017 at 3:40 AM > This is a follow-up on my initial posts regarding character encodings > on Windows > (https://stat.ethz.ch/pipermail/r-devel/2017-August/074728.html) and > Patrick Perry's reply > (https://stat.ethz.ch/pipermail/r-devel/2017-August/074830.html) in > particular (thank you for the links and the bug report!). My initial > posts were quite chaotic (and partly wrong), so I am trying to clear > things up a bit. > > Actually, the title of my original message "special latin1 > [characters] do not print as glyphs in current devel on windows" is > already wrong, because the problem exists with characters with CP1252 > encoding in the 80-9F (hex) range. Like Brian Ripley rightfully > pointed out, latin1 != CP1252. The characters in the 80-9F code point > range are not even part of ISO/IEC 8859-1 a.k.a. latin1, see for > example https://en.wikipedia.org/wiki/Windows-1252. R treats them as > if they were, however, and that is exactly the problem, IMHO. > > Let me show you what I mean. (All output from R 3.5 r73238, see > sessionInfo at the end) > > > Sys.getlocale("LC_CTYPE") > [1] "German_Germany.1252" > > x <- c("€", "ž", "š", "ü") > > sapply(x, charToRaw) > \u0080 \u009e \u009a ü > 80 9e 9a fc > > "€", "ž", "š" serve as examples in the 80-9F range of CP1252. I also > show the "ü" just as an example of a non-ASCII character outside that > range (and because Patrick Perry used it in his bug report which might > be a (slightly) different problem, but I will get to that later.) > > > print(x) > [1] "\u0080" "\u009e" "\u009a" "ü" > > "€", "ž", and "š" are printed as (incorrect) unicode escapes. "€" for > example should be \u20ac not \u0080. > (In R 3.4.1, print(x) shows the glyphs and not the unicode escapes. > Apparently, as of v3.5, print() calls enc2utf8() (or its equivalent in > C (translateCharUTF8?))?) > > > print("\u20ac") > [1] "€" > > The characters in x are marked as "latin1". > > > Encoding(x) > [1] "latin1" "latin1" "latin1" "latin1" > > Looking at the CP1252 table (e.g. link above), we see that this is > incorrect for "€", "ž", and "š", which simply do not exist in latin1. > > As per the documentation, "enc2utf8 convert[s] elements of character > vectors to [...] UTF-8 [...], taking any marked encoding into > account." Since the marked encoding is wrong, so is the output of > enc2utf8(). > > > enc2utf8(x) > [1] "\u0080" "\u009e" "\u009a" "ü" > > Now, when we set the encoding to "unknown" everything works fine. > > > x_un <- x > > Encoding(x_un) <- "unknown" > > print(x_un) > [1] "€" "ž" "š" "ü" > > (x_un2utf8 <- enc2utf8(x_un)) > [1] "€" "ž" "š" "ü" > > Long story short: The characters in the 80 to 9F range should not be > marked as "latin1" on CP1252 locales, IMHO. > > As a side-note: the output of localeToCharset() is also problematic, > since ISO8859-1 != CP1252. > > > localeToCharset() > [1] "ISO8859-1" > > Finally on to Patrick Perry's bug report > (https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17329): 'On > Windows, enc2utf8("ü") yields "|".' > > Unfortunately, I cannot reproduce this with the CP1252 locale, as can > be seen above. Probably, because the bug applies to the C locale > (sorry if this is somewhere apparent in the bug report and I missed it). > > > Sys.setlocale("LC_CTYPE", "C") > [1] "C" > > enc2utf8("ü") > [1] "|" > > charToRaw("ü") > [1] fc > > Encoding("ü") > [1] "unknown" > > This does not seem to be related to the marked encoding of the string, > so it seems to me that this is a different problem than the one above. > > Any advice on how to proceed further would be highly appreciated. > > Thanks! > Daniel > > > sessionInfo() > R Under development (unstable) (2017-09-11 r73238) > Platform: x86_64-w64-mingw32/x64 (64-bit) > Running under: Windows 10 x64 (build 14393) > > Matrix products: default > > locale: > [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=C > [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C > [5] LC_TIME=German_Germany.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > loaded via a namespace (and not attached): > [1] compiler_3.5.0 > [[alternative HTML version deleted]] ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel