Just following up on this since the associated bug report just got closed (https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17329 ) because my original bug report was incomplete, and did not include sessionInfo() or LC_CTYPE.
Admittedly, my original bug report was a little confused. I have since gained a better understanding of the issue. I want to confirm that this (a) is a real bug in base, R, not RStudio (b) provide more context. It looks like the real issue is that R marks native strings as "latin1" when the declared character locale is Windows-1252. This causes problems when converting to UTF-8. See Daniel Possenriede's email below for much more detail, including his sessionInfo() and a reproducible example . The development version of the `stringi` package and the CRAN version of the `utf8` package both have workarounds for this bug. (See, e.g. https://github.com/gagolews/stringi/issues/287 and the links to the related issues). Patrick > Patrick Perry <mailto:ppe...@stern.nyu.edu> > September 14, 2017 at 7:47 AM > This particular issue has a simple fix. Currently, the > "R_check_locale" function includes the following code starting at line > 244 in src/main/platform.c: > > #ifdef Win32 > { > char *ctype = setlocale(LC_CTYPE, NULL), *p; > p = strrchr(ctype, '.'); > if (p && isdigit(p[1])) localeCP = atoi(p+1); else localeCP = 0; > /* Not 100% correct, but CP1252 is a superset */ > known_to_be_latin1 = latin1locale = (localeCP == 1252); > } > #endif > > The "1252" should be "28591"; see > https://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx > > . > > > Daniel Possenriede <mailto:possenri...@gmail.com> > September 14, 2017 at 3:40 AM > This is a follow-up on my initial posts regarding character encodings > on Windows > (https://stat.ethz.ch/pipermail/r-devel/2017-August/074728.html) and > Patrick Perry's reply > (https://stat.ethz.ch/pipermail/r-devel/2017-August/074830.html) in > particular (thank you for the links and the bug report!). My initial > posts were quite chaotic (and partly wrong), so I am trying to clear > things up a bit. > > Actually, the title of my original message "special latin1 > [characters] do not print as glyphs in current devel on windows" is > already wrong, because the problem exists with characters with CP1252 > encoding in the 80-9F (hex) range. Like Brian Ripley rightfully > pointed out, latin1 != CP1252. The characters in the 80-9F code point > range are not even part of ISO/IEC 8859-1 a.k.a. latin1, see for > example https://en.wikipedia.org/wiki/Windows-1252. R treats them as > if they were, however, and that is exactly the problem, IMHO. > > Let me show you what I mean. (All output from R 3.5 r73238, see > sessionInfo at the end) > > > Sys.getlocale("LC_CTYPE") > [1] "German_Germany.1252" > > x <- c("€", "ž", "š", "ü") > > sapply(x, charToRaw) > \u0080 \u009e \u009a ü > 80 9e 9a fc > > "€", "ž", "š" serve as examples in the 80-9F range of CP1252. I also > show the "ü" just as an example of a non-ASCII character outside that > range (and because Patrick Perry used it in his bug report which might > be a (slightly) different problem, but I will get to that later.) > > > print(x) > [1] "\u0080" "\u009e" "\u009a" "ü" > > "€", "ž", and "š" are printed as (incorrect) unicode escapes. "€" for > example should be \u20ac not \u0080. > (In R 3.4.1, print(x) shows the glyphs and not the unicode escapes. > Apparently, as of v3.5, print() calls enc2utf8() (or its equivalent in > C (translateCharUTF8?))?) > > > print("\u20ac") > [1] "€" > > The characters in x are marked as "latin1". > > > Encoding(x) > [1] "latin1" "latin1" "latin1" "latin1" > > Looking at the CP1252 table (e.g. link above), we see that this is > incorrect for "€", "ž", and "š", which simply do not exist in latin1. > > As per the documentation, "enc2utf8 convert[s] elements of character > vectors to [...] UTF-8 [...], taking any marked encoding into > account." Since the marked encoding is wrong, so is the output of > enc2utf8(). > > > enc2utf8(x) > [1] "\u0080" "\u009e" "\u009a" "ü" > > Now, when we set the encoding to "unknown" everything works fine. > > > x_un <- x > > Encoding(x_un) <- "unknown" > > print(x_un) > [1] "€" "ž" "š" "ü" > > (x_un2utf8 <- enc2utf8(x_un)) > [1] "€" "ž" "š" "ü" > > Long story short: The characters in the 80 to 9F range should not be > marked as "latin1" on CP1252 locales, IMHO. > > As a side-note: the output of localeToCharset() is also problematic, > since ISO8859-1 != CP1252. > > > localeToCharset() > [1] "ISO8859-1" > > Finally on to Patrick Perry's bug report > (https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17329): 'On > Windows, enc2utf8("ü") yields "|".' > > Unfortunately, I cannot reproduce this with the CP1252 locale, as can > be seen above. Probably, because the bug applies to the C locale > (sorry if this is somewhere apparent in the bug report and I missed it). > > > Sys.setlocale("LC_CTYPE", "C") > [1] "C" > > enc2utf8("ü") > [1] "|" > > charToRaw("ü") > [1] fc > > Encoding("ü") > [1] "unknown" > > This does not seem to be related to the marked encoding of the string, > so it seems to me that this is a different problem than the one above. > > Any advice on how to proceed further would be highly appreciated. > > Thanks! > Daniel > > > sessionInfo() > R Under development (unstable) (2017-09-11 r73238) > Platform: x86_64-w64-mingw32/x64 (64-bit) > Running under: Windows 10 x64 (build 14393) > > Matrix products: default > > locale: > [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=C > [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C > [5] LC_TIME=German_Germany.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > loaded via a namespace (and not attached): > [1] compiler_3.5.0 > > Patrick Perry <mailto:ppe...@stern.nyu.edu> > August 27, 2017 at 11:40 AM > Regarding the Windows character encoding issues Daniel Possenriede > posted about earlier this month, where non-Latin-1 strings were > getting marked as such > (https://stat.ethz.ch/pipermail/r-devel/2017-August/074731.html ): > > The issue is that on Windows, when the character locale is > Windows-1252, R marks some (possibly all) native non-ASCII strings as > "latin1". I posted a related bug report: > https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17329 . The bug > report also includes a link to a fix for a related issue: converting > strings from Windows native to UTF-8. > > There is a work-around for this bug in the current development version > of the 'corpus' package (not on CRAN yet). See > https://github.com/patperry/r-corpus/issues/5 . I have tested this on > a Windows-1252 install of R, but I have not tested it on a Windows > install in another locale. It'd be great if someone with such an > install would test the fix and report back, either here or on the > github issue. > > > Patrick [[alternative HTML version deleted]] ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel