On Thu, Feb 15, 2018 at 11:19 AM, Kevin Ushey <kevinus...@gmail.com> wrote: > I suspect your UTF-8 string is being stripped of its encoding before > write, and so assumed to be in the system native encoding, and then > re-encoded as UTF-8 when written to the file. You can see something > similar with: > > > tmp <- 'é' > > tmp <- iconv(tmp, to = 'UTF-8') > > Encoding(tmp) <- "unknown" > > charToRaw(iconv(tmp, to = "UTF-8")) > [1] c3 83 c2 a9 > > It's worth saying that: > > file(..., encoding = "UTF-8") > > means "attempt to re-encode strings as UTF-8 when writing to this > file". However, if you already know your text is UTF-8, then you > likely want to avoid opening a connection that might attempt to > re-encode the input. Conversely (assuming I'm understanding the > documentation correctly) > > file(..., encoding = "native.enc") > > means "assume that strings are in the native encoding, and hence > translation is unnecessary". Note that it does not mean "attempt to > translate strings to the native encoding".
If all that is true I think ?file needs some attention. I've read it several times now and I just don't see how it can be interpreted as you've described it. Best, Ista > > Also note that writeLines(..., useBytes = FALSE) will explicitly > translate to the current encoding before sending bytes to the > requested connection. In other words, there are two locations where > translation might occur in your example: > > 1) In the call to writeLines(), > 2) When characters are passed to the connection. > > In your case, it sounds like translation should be suppressed at both steps. > > I think this is documented correctly in ?writeLines (and also the > Encoding section of ?file), but the behavior may feel unfamiliar at > first glance. > > Kevin > > On Wed, Feb 14, 2018 at 11:36 PM, Davor Josipovic <dav...@live.com> wrote: >> >> I think this behavior is inconsistent with the documentation: >> >> tmp <- 'é' >> tmp <- iconv(tmp, to = 'UTF-8') >> print(Encoding(tmp)) >> print(charToRaw(tmp)) >> tmpfilepath <- tempfile() >> writeLines(tmp, con = file(tmpfilepath, encoding = 'UTF-8'), useBytes = >> TRUE) >> >> [1] "UTF-8" >> [1] c3 a9 >> >> Raw text as hex: c3 83 c2 a9 >> >> If I switch to useBytes = FALSE, then the variable is written correctly as >> c3 a9. >> >> Any thoughts? This behavior is related to this issue: >> https://github.com/yihui/knitr/issues/1509 >> >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel