Of course, right after writing this e-mail I tested on my Windows machine and did not see what I expected:
> charToRaw(before) [1] c3 a9 > charToRaw(after) [1] e9 so obviously I'm misunderstanding something as well. Best, Kevin On Sat, Feb 17, 2018 at 2:19 PM, Kevin Ushey <kevinus...@gmail.com> wrote: > From my understanding, translation is implied in this line of ?file (from the > Encoding section): > > The encoding of the input/output stream of a connection can be specified > by name in the same way as it would be given to iconv: see that help page > for how to find out what encoding names are recognized on your platform. > Additionally, "" and "native.enc" both mean the ‘native’ encoding, that is > the internal encoding of the current locale and hence no translation is > done. > > This is also hinted at in the documentation in ?readLines for its 'encoding' > argument, which has a different semantic meaning from the 'encoding' argument > as used with R connections: > > encoding to be assumed for input strings. It is used to mark character > strings as known to be in Latin-1 or UTF-8: it is not used to re-encode > the input. To do the latter, specify the encoding as part of the > connection con or via options(encoding=): see the examples. > > It might be useful to augment the documentation in ?file with something like: > > The 'encoding' argument is used to request the translation of strings when > writing to a connection. > > and, perhaps to further drive home the point about not translating when > encoding = "native.enc": > > Note that R will not attempt translation of strings when encoding is > either "" or "native.enc" (the default, as per getOption("encoding")). > This implies that attempting to write, for example, UTF-8 encoded content > to a connection opened using "native.enc" will retain its original UTF-8 > encoding -- it will not be translated. > > It is a bit surprising that 'native.enc' means "do not translate" rather than > "attempt translation to the encoding associated with the current locale", but > those are the semantics and they are not bound to change. > > This is the code I used to convince myself of that case: > > conn <- file(tempfile(), encoding = "native.enc", open = "w+") > > before <- iconv('é', to = "UTF-8") > cat(before, file = conn, sep = "\n") > after <- readLines(conn) > > charToRaw(before) > charToRaw(after) > > with output: > > > charToRaw(before) > [1] c3 a9 > > charToRaw(after) > [1] c3 a9 > > Best, > Kevin > > > On Thu, Feb 15, 2018 at 9:16 AM, Ista Zahn <istaz...@gmail.com> wrote: >> On Thu, Feb 15, 2018 at 11:19 AM, Kevin Ushey <kevinus...@gmail.com> wrote: >>> I suspect your UTF-8 string is being stripped of its encoding before >>> write, and so assumed to be in the system native encoding, and then >>> re-encoded as UTF-8 when written to the file. You can see something >>> similar with: >>> >>> > tmp <- 'é' >>> > tmp <- iconv(tmp, to = 'UTF-8') >>> > Encoding(tmp) <- "unknown" >>> > charToRaw(iconv(tmp, to = "UTF-8")) >>> [1] c3 83 c2 a9 >>> >>> It's worth saying that: >>> >>> file(..., encoding = "UTF-8") >>> >>> means "attempt to re-encode strings as UTF-8 when writing to this >>> file". However, if you already know your text is UTF-8, then you >>> likely want to avoid opening a connection that might attempt to >>> re-encode the input. Conversely (assuming I'm understanding the >>> documentation correctly) >>> >>> file(..., encoding = "native.enc") >>> >>> means "assume that strings are in the native encoding, and hence >>> translation is unnecessary". Note that it does not mean "attempt to >>> translate strings to the native encoding". >> >> If all that is true I think ?file needs some attention. I've read it >> several times now and I just don't see how it can be interpreted as >> you've described it. >> >> Best, >> Ista >> >>> >>> Also note that writeLines(..., useBytes = FALSE) will explicitly >>> translate to the current encoding before sending bytes to the >>> requested connection. In other words, there are two locations where >>> translation might occur in your example: >>> >>> 1) In the call to writeLines(), >>> 2) When characters are passed to the connection. >>> >>> In your case, it sounds like translation should be suppressed at both steps. >>> >>> I think this is documented correctly in ?writeLines (and also the >>> Encoding section of ?file), but the behavior may feel unfamiliar at >>> first glance. >>> >>> Kevin >>> >>> On Wed, Feb 14, 2018 at 11:36 PM, Davor Josipovic <dav...@live.com> wrote: >>>> >>>> I think this behavior is inconsistent with the documentation: >>>> >>>> tmp <- 'é' >>>> tmp <- iconv(tmp, to = 'UTF-8') >>>> print(Encoding(tmp)) >>>> print(charToRaw(tmp)) >>>> tmpfilepath <- tempfile() >>>> writeLines(tmp, con = file(tmpfilepath, encoding = 'UTF-8'), useBytes = >>>> TRUE) >>>> >>>> [1] "UTF-8" >>>> [1] c3 a9 >>>> >>>> Raw text as hex: c3 83 c2 a9 >>>> >>>> If I switch to useBytes = FALSE, then the variable is written correctly as >>>> c3 a9. >>>> >>>> Any thoughts? This behavior is related to this issue: >>>> https://github.com/yihui/knitr/issues/1509 >>>> >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> ______________________________________________ >>>> R-devel@r-project.org mailing list >>>> https://stat.ethz.ch/mailman/listinfo/r-devel >>> >>> ______________________________________________ >>> R-devel@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-devel ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel