On 4/10/19 10:22 AM, Tomáš Bořil wrote: > Hello, > > There is a long-lasting problem with processing UTF-8 source code in R > on Windows OS. As Windows do not have "UTF-8" locale and R passes > source code through OS before executing it, some characters are > "simplified" by the OS before processing, leading to undesirable > changes. > > Minimalistic example: > Let's type "ř" (LATIN SMALL LETTER R WITH CARON) in RGui console: >> "ř" > [1] "r" > > Let's assume the following script: > # file [script.R] > if ("ř" != "\U00159") { > stop("Problem: Unexpected character conversion.") > } else { > cat("o.k.\n") > } > > Problem: > source("script.R", encoding = "UTF-8") > > OK (see > https://stackoverflow.com/questions/5031630/how-to-source-r-file-saved-using-utf-8-encoding): > eval(parse("script.R", encoding = "UTF-8"))
On my system with your example, > source("t.r") Error in eval(ei, envir) : Problem: Unexpected character conversion. > source("/Users/tomas/t.r", encoding="UTF-8") Error in eval(ei, envir) : Problem: Unexpected character conversion.. > eval(parse("t.r", encoding="UTF-8")) o.k. Which is expected, unfortunately. As per documentation of ?source, the "encoding" argument tells source() that the input is in UTF-8, so that source() can convert it to the native encoding. Again as documented, parse() uses its encoding argument to mark the encoding of the strings, but it does not re-encode, and the character strings in the parsed result will as documented have the encoding mark (UTF-8 in this case). > Although the script is in UTF-8, the characters are replaced by > "simplified" substitutes uncontrollably (depending on OS locale). The > same goes with simply entering the code statements in R Console. > > The problem does not occur on OS with UTF-8 locale (Mac OS, Linux...) Yes. By default, Windows uses "best fit" when translating characters to the native encoding. This could be changed in principle, but could break existing applications that may depend on it, and it won't really help because such characters cannot be represented anyway. You can find more in ?Encoding, but yes, it is a known problem frequently encountered by users and unless Windows starts supporting UTF-8 as native encoding, there is no easy fix (a version from Windows 10 Insider preview supports it, so maybe that is not completely hopeless). In theory you can carefully read the documentation and use only functions that can work with UTF-8 without converting to native encoding, but pragmatically, if you want to work with UTF-8 files in R, it is best to use a non-Windows platform. Best Tomas > > Best regards > Tomas Boril > >> R.version > _ > platform x86_64-w64-mingw32 > arch x86_64 > os mingw32 > system x86_64, mingw32 > status alpha > major 3 > minor 6.0 > year 2019 > month 04 > day 07 > svn rev 76333 > language R > version.string R version 3.6.0 alpha (2019-04-07 r76333) > nickname > >> Sys.getlocale() > [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United > States.1252;LC_MONETARY=English_United > States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252" > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel [[alternative HTML version deleted]] ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel