This is a follow-up on my initial posts regarding character encodings on Windows (https://stat.ethz.ch/pipermail/r-devel/2017-August/074728.html) and Patrick Perry's reply (https://stat.ethz.ch/pipermail/r-devel/2017-August/074830.html) in particular (thank you for the links and the bug report!). My initial posts were quite chaotic (and partly wrong), so I am trying to clear things up a bit.

Actually, the title of my original message "special latin1 [characters] do not print as glyphs in current devel on windows" is already wrong, because the problem exists with characters with CP1252 encoding in the 80-9F (hex) range. Like Brian Ripley rightfully pointed out, latin1 != CP1252. The characters in the 80-9F code point range are not even part of ISO/IEC 8859-1 a.k.a. latin1, see for example https://en.wikipedia.org/wiki/Windows-1252. R treats them as if they were, however, and that is exactly the problem, IMHO.

Let me show you what I mean. (All output from R 3.5 r73238, see sessionInfo at the end)

> Sys.getlocale("LC_CTYPE")
[1] "German_Germany.1252"
> x <- c("€", "ž", "š", "ü")
> sapply(x, charToRaw)
\u0080 \u009e \u009a  ü
80 9e 9a fc

"€", "ž", "š" serve as examples in the 80-9F range of CP1252. I also show the "ü" just as an example of a non-ASCII character outside that range (and because Patrick Perry used it in his bug report which might be a (slightly) different problem, but I will get to that later.)

> print(x)
[1] "\u0080" "\u009e" "\u009a" "ü"

"€", "ž", and "š" are printed as (incorrect) unicode escapes. "€" for example should be \u20ac not \u0080. (In R 3.4.1, print(x) shows the glyphs and not the unicode escapes. Apparently, as of v3.5, print() calls enc2utf8() (or its equivalent in C (translateCharUTF8?))?)

> print("\u20ac")
[1] "€"

The characters in x are marked as "latin1".

> Encoding(x)
[1] "latin1" "latin1" "latin1" "latin1"

Looking at the CP1252 table (e.g. link above), we see that this is incorrect for "€", "ž", and "š", which simply do not exist in latin1.

As per the documentation, "enc2utf8 convert[s] elements of character vectors to [...] UTF-8 [...], taking any marked encoding into account." Since the marked encoding is wrong, so is the output of enc2utf8().

> enc2utf8(x)
[1] "\u0080" "\u009e" "\u009a" "ü"

Now, when we set the encoding to "unknown" everything works fine.

> x_un <- x
> Encoding(x_un) <- "unknown"
> print(x_un)
[1] "€" "ž" "š" "ü"
> (x_un2utf8 <- enc2utf8(x_un))
[1] "€" "ž" "š" "ü"

Long story short: The characters in the 80 to 9F range should not be marked as "latin1" on CP1252 locales, IMHO.

As a side-note: the output of localeToCharset() is also problematic, since ISO8859-1 != CP1252.

> localeToCharset()
[1] "ISO8859-1"

Finally on to Patrick Perry's bug report (https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17329): 'On Windows, enc2utf8("ü") yields "|".'

Unfortunately, I cannot reproduce this with the CP1252 locale, as can be seen above. Probably, because the bug applies to the C locale (sorry if this is somewhere apparent in the bug report and I missed it).

> Sys.setlocale("LC_CTYPE", "C")
[1] "C"
> enc2utf8("ü")
[1] "|"
> charToRaw("ü")
[1] fc
> Encoding("ü")
[1] "unknown"

This does not seem to be related to the marked encoding of the string, so it seems to me that this is a different problem than the one above.

Any advice on how to proceed further would be highly appreciated.

Thanks!
Daniel

> sessionInfo()
R Under development (unstable) (2017-09-11 r73238)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 14393)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=C
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
[5] LC_TIME=German_Germany.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods base

loaded via a namespace (and not attached):
[1] compiler_3.5.0

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to