Ps. I will try to have a go at using your experimental version to see if that could help us out. If I run into trouble I will mail you personally.
On Thu, 17 Dec 2020 at 17:17, jo...@jorisgoosen.nl <jo...@jorisgoosen.nl> wrote: > > > On Thu, 17 Dec 2020 at 10:46, Tomas Kalibera <tomas.kalib...@gmail.com> > wrote: > >> On 12/16/20 11:07 PM, jo...@jorisgoosen.nl wrote: >> > David, >> > >> > Thanks for the response! >> > >> > So the problem is a bit worse then just setting `encoding="UTF-8"` on >> > functions like readLines. >> > I'll describe our setup a bit: >> > So we run R embedded in a separate executable and through a whole bunch >> of >> > C(++) magic get that to the main executable that runs the actual >> interface. >> > All the code that isn't R basically uses UTF-8. This works good and >> we've >> > made sure that all of our source code is encoded properly and I've >> verified >> > that for this particular problem at least my source file is definitely >> > encoded in UTF-8 (Ive checked a hexdump). >> > >> > The simplest solution, that we initially took, to get R+Windows to >> > cooperate with everything is to simply set the locale to "C" before >> > starting R. That way R simply assumes UTF-8 is native and everything >> worked >> > splendidly. Until of course a file needs to be opened in R that contains >> > some non-ASCII characters. I noticed the problem because a korean user >> had >> > hangul in his username and that broke everything. This because R was >> trying >> > to convert to a different locale than Windows was using. >> >> Setting locale to "C" does not make R assume UTF-8 is the native >> encoding, there is no way to make UTF-8 the current native encoding in R >> on the current builds of R on Windows. This is an old limitation of >> Windows, only recently fixed by Microsoft in recent Windows 10 and with >> UCRT Windows runtime (see my blog post [1] for more - to make R support >> this we need a new toolchain to build R). >> >> If you set the locale to C encoding, you are telling R the native >> encoding is C/POSIX (essentially ASCII), not UTF-8. Encoding-sensitive >> operations, including conversions, including those conversions that >> happen without user control e.g. for interacting with Windows, will >> produce incorrect results (garbage) or in better case errors, warnings, >> omitted, substituted or transliterated characters. >> >> In principle setting the encoding via locale is dangerous on Windows, >> because Windows has two current encodings, not just one. By setting >> locale you set the one used in the C runtime, but not the other one used >> by the system calls. If all code (in R, packages, external libraries) >> was perfect, this would still work as long as all strings used were >> representable in both encodings. For other strings it won't work, and >> then code is not perfect in this regard, it is usually written assuming >> there is one current encoding, which common sense dictates should be the >> case. With the recent UTF-8 support ([1]), one can switch both of these >> to UTF-8. >> > > Well, this is exactly why I want to get rid of the situation. But this > messes up the output because everything else expects UTF-8 which is why I'm > looking for some kind of solution. > > > >> > The solution I've now been working on is: >> > I took the sourcecode of R 4.0.3 and changed the backend of "gettext" to >> > add an `encoding="something something"` option. And a bit of extra stuff >> > like `bind_textdomain_codeset` in case I need to tweak the >> codeset/charset >> > that gettext uses. >> > I think I've got that working properly now and once I solve the problem >> of >> > the encoding in a pkg I will open a bugreport/feature-request and I'll >> add >> > a patch that implements it. >> >> A number of similar "shortcuts" have been added to R in the past, but >> they may the code more complex, harder to maintain and use, and can't >> realistically solve all of these problems, anyway. Strings will >> eventually be assumed to be in what is the current native encoding by >> the C library. In R, any external code R uses, or code R packages use. >> Now that Microsoft finally is supporting UTF-8, the way to get out of >> this is switching to UTF-8. This needs only small changes to R source >> code compared to those "shortcuts" (or to using UTF-16LE). I'd be >> against polluting the code with any more "shortcuts". >> > > I think the addition of " bind_textdomain_codeset" is not strictly > necessary and can be left out. Because I think setting an environment > variable as "OUTPUT_CHARSET=UTF-8" gives the same result for us. > The addition of the "encoding" option to the internal "do_gettext" is just > a few lines of code and I also undid some duplication between do_gettext > and do_ngettext. Which should make it easier to maintain. But all of that > is moot if there is no way to keep the literal strings from sources in > UTF-8 anyhow. > > Before starting on this I did actually read your blogpost about UTF-8 > several times and it seems like the best way forward. Not to mention it > would make my life easier and me happier when I can stop worrying about > Windows/Dos codepages! > Thank you for your work on it indeed! > > But my problem with that is that a number of people still use an older > version of windows and your solution won't work there. Which would mean > that we either drop support for them or they would have to live with either > weirdlooking translations. Or I have to go back to the suboptimal solution > of the "C" locale which I really do want to avoid. Because as you said it > breaks other stuff in unpredictable ways. > > >> > The problem I'm stuck with now is simply this: >> > I have an R pkg here that I want to test the translations with and the >> code >> > is definitely saved as UTF-8, the package has "Encoding: UTF-8" in the >> > DESCRIPTION and it all loads and works. The particular problem I have is >> > that the R code contains literally: `mathotString <- "Mathôt!"` >> > The actual file contains the hexadecimal representation of ô as proper >> > utf-8: "0xC3 0xB4" but R turns it into: "0xf4". >> > Seemingly on loading the package, because I haven't done anything with >> it >> > except put it in my debug c-function to print its contents as >> > hexadecimals... >> > >> > The only thing I want to achieve here is that when R loads the package >> it >> > keeps those strings in their original UTF-8 encoding, without >> converting it >> > to "native" or the strange unicode codepoint it seemingly placed in >> there >> > instead. Because otherwise I cannot get gettext to work fully in UTF-8 >> mode. >> > >> > Is this already possible in R? >> >> In principle, working with strings not representable in the current >> encoding is not reliable (and never will be). It can still work in some >> specific cases and uses. Parsing a UTF-8 string literal from a file, >> with correctly declared encoding as documented in WRE, should work at >> least in single-byte encodings. But what happens after that string is >> parsed is another thing. The parsing is based internally on using these >> "shortcuts", that is lying to a part of the parser about the encoding, >> and telling the rest of the parser that it is really something else (not >> native, but UTF-8). > > > So the reason the string literals are turned into the local encoding is > because setting the "Encoding" on a package is essentially a hack? > > >> The part that is being "lied to" may get confused or >> not. It would not when the real native encoding is say latin1, a common >> case in the past for which the hack was created, but it might when it is >> a double-byte encoding that conflicts with the text being parsed in >> dangerous ways. This is also why this hack only makes sense for string >> literals (and comments), and still only to a limit as the strings may be >> misinterpreted later after parsing. >> > > Well our case is entirely limited to string literals that are presented to > the user through an all-utf-8 interface. > So I would assume not of the edge-cases would come into play. > Any systempaths and things like that would still be in local encoding. > > >> So a really short summary is: you can only reliably use strings >> representable in the current encoding in R, and that encoding cannot be >> UTF-8 on Windows in released versions of R. There is an experimental >> version, see [1], if you could experiment with that and see whether that >> might work for your applications, could try to find and report bugs >> there (e.g. to me directly), that would be useful. >> > > So when I read in certain R documentation that string can have an "UTF-8" > encoding in R this is not true? > As in, when I read documentation such as > https://stat.ethz.ch/R-manual/R-devel/library/base/html/Encoding.html it > really seems to indicate to me that UTF-8 is in fact supported in R on > windows. > My assumption was that R uses `translateChar` internally to make sure it > is in the right encoding before interfacing with the OS and other places > where this might matter. > > >> If you find behavior re encodings in released versions of R that >> contradicts the current documentation, please report with a minimal >> reproducible example, such cases should be fixed (even though sometimes >> the "fix" would be just changing the documentation, the effort really >> should be now for supporting UTF-8 for real). Specifically with >> "mathotString", you might try creating an example that does not include >> any package (just calls to parse with encoding options set), only then >> gradually adding more of package loading if that does not reproduce. It >> would be important to know the current encoding (sessionInfo, l10n_info). >> > > Well, the reason I mailed the mailing list was because I couldn't for the > life of me find any documentation that told me anything in particular about > how literal strings are supposed to be stored in memory. But it just seems > logical to me that if R already supports parsing and loading a package > encoded with UTF-8 and it supports having UTF-8 strings in memory next to > strings in native encoding the most straightforward way of loading this > literal strings would be in UTF-8. > > I would love to use the new version of R that supports properly > interfacing with windows 10. > And given that the only other supported version of Windows is 8.1 and > barely anyone uses it. So it might be worth dropping support for that. > I just hoped I could find a workable solution without such a step. > > Cheers, > Joris > > >> >> Best, >> Tomas >> >> [1] >> >> https://developer.r-project.org/Blog/public/2020/07/30/windows/utf-8-build-of-r-and-cran-packages/index.html >> >> > >> > Cheers, >> > Joris >> >> > >> > >> > On Wed, 16 Dec 2020 at 20:15, David Bosak <dbosa...@gmail.com> wrote: >> > >> >> Joris: >> >> >> >> >> >> >> >> I’ve fought with encoding problems on Windows a lot. Here are some >> >> general suggestions. >> >> >> >> >> >> >> >> 1. Put “@encoding UTF-8” on any Roxygen comments. >> >> 2. Put “encoding = “UTF-8” on any functions like writeLines or >> >> readLines that read/write to a text file. >> >> 3. This post: >> >> >> https://kevinushey.github.io/blog/2018/02/21/string-encoding-and-r/ >> >> >> >> >> >> >> >> If you have a more specific problem, please describe and we can try to >> >> help. >> >> >> >> >> >> >> >> David >> >> >> >> >> >> >> >> Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for >> >> Windows 10 >> >> >> >> >> >> >> >> *From: *jo...@jorisgoosen.nl >> >> *Sent: *Wednesday, December 16, 2020 1:52 PM >> >> *To: *r-package-devel@r-project.org >> >> *Subject: *[R-pkg-devel] Package Encoding and Literal Strings >> >> >> >> >> >> >> >> Hello All, >> >> >> >> >> >> >> >> Some context, I am one of the programmers of a software pkg ( >> >> >> >> https://jasp-stats.org/) that uses an embedded instance of R to do >> >> >> >> statistics. And make that a bit easier for people who are intimidated >> by R >> >> >> >> or like to have something more GUI oriented. >> >> >> >> >> >> >> >> >> >> >> >> We have been working on translating the interface but ran into several >> >> >> >> problems related to encoding of strings. We prefer to use UTF-8 for >> >> >> >> everything and this works wonderful on unix systems, as is to be >> expected. >> >> >> >> >> >> >> >> Windows however is a different matter. Currently I am working on some >> local >> >> >> >> changes to "do_gettext" and some related internal functions of R to be >> able >> >> >> >> to get UTF-8 encoded output from there. >> >> >> >> >> >> >> >> But I ran into a bit of a problem and I think this mailinglist is >> probably >> >> >> >> the best place to start. >> >> >> >> >> >> >> >> It seems that if I have an R package that specifies "Encoding: UTF-8" >> in >> >> >> >> DESCRIPTION the literal strings inside the package are converted to the >> >> >> >> local codeset/codepage regardless of what I want. >> >> >> >> >> >> >> >> Is it possible to keep the strings in UTF-8 internally in such a pkg >> >> >> >> somehow? >> >> >> >> >> >> >> >> Best regards, >> >> >> >> Joris Goosen >> >> >> >> University of Amsterdam >> >> >> >> >> >> >> >> [[alternative HTML version deleted]] >> >> >> >> >> >> >> >> ______________________________________________ >> >> >> >> R-package-devel@r-project.org mailing list >> >> >> >> https://stat.ethz.ch/mailman/listinfo/r-package-devel >> >> >> >> >> >> >> > [[alternative HTML version deleted]] >> > >> > ______________________________________________ >> > R-package-devel@r-project.org mailing list >> > https://stat.ethz.ch/mailman/listinfo/r-package-devel >> >> >> [[alternative HTML version deleted]] ______________________________________________ R-package-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-package-devel