Re: [Rd] special latin1 do not print as glyphs in current devel on windows

Daniel Possenriede Thu, 14 Sep 2017 00:41:05 -0700

This is a follow-up on my initial posts regarding character encodings onWindows (https://stat.ethz.ch/pipermail/r-devel/2017-August/074728.html)and Patrick Perry's reply(https://stat.ethz.ch/pipermail/r-devel/2017-August/074830.html) inparticular (thank you for the links and the bug report!). My initialposts were quite chaotic (and partly wrong), so I am trying to clearthings up a bit.

Actually, the title of my original message "special latin1 [characters]do not print as glyphs in current devel on windows" is already wrong,because the problem exists with characters with CP1252 encoding in the80-9F (hex) range. Like Brian Ripley rightfully pointed out, latin1 !=CP1252. The characters in the 80-9F code point range are not even partof ISO/IEC 8859-1 a.k.a. latin1, see for examplehttps://en.wikipedia.org/wiki/Windows-1252. R treats them as if theywere, however, and that is exactly the problem, IMHO.

Let me show you what I mean. (All output from R 3.5 r73238, seesessionInfo at the end)


> Sys.getlocale("LC_CTYPE")
[1] "German_Germany.1252"
> x <- c("€", "ž", "š", "ü")
> sapply(x, charToRaw)
\u0080 \u009e \u009a  ü
80 9e 9a fc

"€", "ž", "š" serve as examples in the 80-9F range of CP1252. I alsoshow the "ü" just as an example of a non-ASCII character outside thatrange (and because Patrick Perry used it in his bug report which mightbe a (slightly) different problem, but I will get to that later.)


> print(x)
[1] "\u0080" "\u009e" "\u009a" "ü"

"€", "ž", and "š" are printed as (incorrect) unicode escapes. "€" forexample should be \u20ac not \u0080.(In R 3.4.1, print(x) shows the glyphs and not the unicode escapes.Apparently, as of v3.5, print() calls enc2utf8() (or its equivalent in C(translateCharUTF8?))?)


> print("\u20ac")
[1] "€"

The characters in x are marked as "latin1".

> Encoding(x)
[1] "latin1" "latin1" "latin1" "latin1"

Looking at the CP1252 table (e.g. link above), we see that this isincorrect for "€", "ž", and "š", which simply do not exist in latin1.

As per the documentation, "enc2utf8 convert[s] elements of charactervectors to [...] UTF-8 [...], taking any marked encoding into account."Since the marked encoding is wrong, so is the output of enc2utf8().


> enc2utf8(x)
[1] "\u0080" "\u009e" "\u009a" "ü"

Now, when we set the encoding to "unknown" everything works fine.

> x_un <- x
> Encoding(x_un) <- "unknown"
> print(x_un)
[1] "€" "ž" "š" "ü"
> (x_un2utf8 <- enc2utf8(x_un))
[1] "€" "ž" "š" "ü"

Long story short: The characters in the 80 to 9F range should not bemarked as "latin1" on CP1252 locales, IMHO.

As a side-note: the output of localeToCharset() is also problematic,since ISO8859-1 != CP1252.


> localeToCharset()
[1] "ISO8859-1"

Finally on to Patrick Perry's bug report(https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17329): 'OnWindows, enc2utf8("ü") yields "|".'

Unfortunately, I cannot reproduce this with the CP1252 locale, as can beseen above. Probably, because the bug applies to the C locale (sorry ifthis is somewhere apparent in the bug report and I missed it).


> Sys.setlocale("LC_CTYPE", "C")
[1] "C"
> enc2utf8("ü")
[1] "|"
> charToRaw("ü")
[1] fc
> Encoding("ü")
[1] "unknown"

This does not seem to be related to the marked encoding of the string,so it seems to me that this is a different problem than the one above.


Any advice on how to proceed further would be highly appreciated.

Thanks!
Daniel

> sessionInfo()
R Under development (unstable) (2017-09-11 r73238)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 14393)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=C
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
[5] LC_TIME=German_Germany.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods base

loaded via a namespace (and not attached):
[1] compiler_3.5.0

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] special latin1 do not print as glyphs in current devel on windows

Reply via email to