I think it is as Kevin described in an earlier response - the garbled output is because a UTF-8 encoded string is assumed to be native encoding (which happens not to be UTF-8 on the platform where this is observed) and converted again to UTF-8.

I think the documentation is consistent with the observed behavior

   tmp <- 'é'
   tmp <- iconv(tmp, to = 'UTF-8')
   print(Encoding(tmp))
   print(charToRaw(tmp))
   tmpfilepath <- tempfile()
   writeLines(tmp, con = file(tmpfilepath, encoding = 'UTF-8'), useBytes = TRUE)

[1] "UTF-8"
[1] c3 a9

Raw text as hex: c3 83 c2 a9
useBytes=TRUE in writeLines means that the UTF-8 string will be passed byte-by-byte to the connection. encoding="UTF-8" tells the connection to convert the bytes to UTF-8 (from native encoding). So the second step is converting a string which is assumed to be in native encoding, but in fact it is in UTF-8.

The documentation describes "useBytes=TRUE" as for expert use only, it can be useful for avoiding unnecessary conversions in some special cases, but one has then to make sure that no more conversions are attempted (so use "" as encoding of in "file", for instance). The long advice short would be to not use useBytes=TRUE with writeLines, but depend on the default behavior.

Tomas


On 02/17/2018 11:24 PM, Kevin Ushey wrote:
Of course, right after writing this e-mail I tested on my Windows
machine and did not see what I expected:

charToRaw(before)
[1] c3 a9
charToRaw(after)
[1] e9

so obviously I'm misunderstanding something as well.

Best,
Kevin

On Sat, Feb 17, 2018 at 2:19 PM, Kevin Ushey <kevinus...@gmail.com> wrote:
 From my understanding, translation is implied in this line of ?file (from the
Encoding section):

     The encoding of the input/output stream of a connection can be specified
     by name in the same way as it would be given to iconv: see that help page
     for how to find out what encoding names are recognized on your platform.
     Additionally, "" and "native.enc" both mean the ‘native’ encoding, that is
     the internal encoding of the current locale and hence no translation is
     done.

This is also hinted at in the documentation in ?readLines for its 'encoding'
argument, which has a different semantic meaning from the 'encoding' argument
as used with R connections:

     encoding to be assumed for input strings. It is used to mark character
     strings as known to be in Latin-1 or UTF-8: it is not used to re-encode
     the input. To do the latter, specify the encoding as part of the
     connection con or via options(encoding=): see the examples.

It might be useful to augment the documentation in ?file with something like:

     The 'encoding' argument is used to request the translation of strings when
     writing to a connection.

and, perhaps to further drive home the point about not translating when
encoding = "native.enc":

     Note that R will not attempt translation of strings when encoding is
     either "" or "native.enc" (the default, as per getOption("encoding")).
     This implies that attempting to write, for example, UTF-8 encoded content
     to a connection opened using "native.enc" will retain its original UTF-8
     encoding -- it will not be translated.

It is a bit surprising that 'native.enc' means "do not translate" rather than
"attempt translation to the encoding associated with the current locale", but
those are the semantics and they are not bound to change.

This is the code I used to convince myself of that case:

     conn <- file(tempfile(), encoding = "native.enc", open = "w+")

     before <- iconv('é', to = "UTF-8")
     cat(before, file = conn, sep = "\n")
     after <- readLines(conn)

     charToRaw(before)
     charToRaw(after)

with output:

     > charToRaw(before)
     [1] c3 a9
     > charToRaw(after)
     [1] c3 a9

Best,
Kevin


On Thu, Feb 15, 2018 at 9:16 AM, Ista Zahn <istaz...@gmail.com> wrote:
On Thu, Feb 15, 2018 at 11:19 AM, Kevin Ushey <kevinus...@gmail.com> wrote:
I suspect your UTF-8 string is being stripped of its encoding before
write, and so assumed to be in the system native encoding, and then
re-encoded as UTF-8 when written to the file. You can see something
similar with:

     > tmp <- 'é'
     > tmp <- iconv(tmp, to = 'UTF-8')
     > Encoding(tmp) <- "unknown"
     > charToRaw(iconv(tmp, to = "UTF-8"))
     [1] c3 83 c2 a9

It's worth saying that:

     file(..., encoding = "UTF-8")

means "attempt to re-encode strings as UTF-8 when writing to this
file". However, if you already know your text is UTF-8, then you
likely want to avoid opening a connection that might attempt to
re-encode the input. Conversely (assuming I'm understanding the
documentation correctly)

     file(..., encoding = "native.enc")

means "assume that strings are in the native encoding, and hence
translation is unnecessary". Note that it does not mean "attempt to
translate strings to the native encoding".
If all that is true I think ?file needs some attention. I've read it
several times now and I just don't see how it can be interpreted as
you've described it.

Best,
Ista

Also note that writeLines(..., useBytes = FALSE) will explicitly
translate to the current encoding before sending bytes to the
requested connection. In other words, there are two locations where
translation might occur in your example:

    1) In the call to writeLines(),
    2) When characters are passed to the connection.

In your case, it sounds like translation should be suppressed at both steps.

I think this is documented correctly in ?writeLines (and also the
Encoding section of ?file), but the behavior may feel unfamiliar at
first glance.

Kevin

On Wed, Feb 14, 2018 at 11:36 PM, Davor Josipovic <dav...@live.com> wrote:
I think this behavior is inconsistent with the documentation:

   tmp <- 'é'
   tmp <- iconv(tmp, to = 'UTF-8')
   print(Encoding(tmp))
   print(charToRaw(tmp))
   tmpfilepath <- tempfile()
   writeLines(tmp, con = file(tmpfilepath, encoding = 'UTF-8'), useBytes = TRUE)

[1] "UTF-8"
[1] c3 a9

Raw text as hex: c3 83 c2 a9

If I switch to useBytes = FALSE, then the variable is written correctly as  c3 
a9.

Any thoughts? This behavior is related to this issue: 
https://github.com/yihui/knitr/issues/1509


         [[alternative HTML version deleted]]

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to