Re: [Rd] writeLines argument useBytes = TRUE still making conversions

Kevin Ushey Sat, 17 Feb 2018 14:25:18 -0800

Of course, right after writing this e-mail I tested on my Windows
machine and did not see what I expected:


> charToRaw(before)
[1] c3 a9
> charToRaw(after)
[1] e9

so obviously I'm misunderstanding something as well.

Best,
Kevin

On Sat, Feb 17, 2018 at 2:19 PM, Kevin Ushey <kevinus...@gmail.com> wrote:
> From my understanding, translation is implied in this line of ?file (from the
> Encoding section):
>
>     The encoding of the input/output stream of a connection can be specified
>     by name in the same way as it would be given to iconv: see that help page
>     for how to find out what encoding names are recognized on your platform.
>     Additionally, "" and "native.enc" both mean the ‘native’ encoding, that is
>     the internal encoding of the current locale and hence no translation is
>     done.
>
> This is also hinted at in the documentation in ?readLines for its 'encoding'
> argument, which has a different semantic meaning from the 'encoding' argument
> as used with R connections:
>
>     encoding to be assumed for input strings. It is used to mark character
>     strings as known to be in Latin-1 or UTF-8: it is not used to re-encode
>     the input. To do the latter, specify the encoding as part of the
>     connection con or via options(encoding=): see the examples.
>
> It might be useful to augment the documentation in ?file with something like:
>
>     The 'encoding' argument is used to request the translation of strings when
>     writing to a connection.
>
> and, perhaps to further drive home the point about not translating when
> encoding = "native.enc":
>
>     Note that R will not attempt translation of strings when encoding is
>     either "" or "native.enc" (the default, as per getOption("encoding")).
>     This implies that attempting to write, for example, UTF-8 encoded content
>     to a connection opened using "native.enc" will retain its original UTF-8
>     encoding -- it will not be translated.
>
> It is a bit surprising that 'native.enc' means "do not translate" rather than
> "attempt translation to the encoding associated with the current locale", but
> those are the semantics and they are not bound to change.
>
> This is the code I used to convince myself of that case:
>
>     conn <- file(tempfile(), encoding = "native.enc", open = "w+")
>
>     before <- iconv('é', to = "UTF-8")
>     cat(before, file = conn, sep = "\n")
>     after <- readLines(conn)
>
>     charToRaw(before)
>     charToRaw(after)
>
> with output:
>
>     > charToRaw(before)
>     [1] c3 a9
>     > charToRaw(after)
>     [1] c3 a9
>
> Best,
> Kevin
>
>
> On Thu, Feb 15, 2018 at 9:16 AM, Ista Zahn <istaz...@gmail.com> wrote:
>> On Thu, Feb 15, 2018 at 11:19 AM, Kevin Ushey <kevinus...@gmail.com> wrote:
>>> I suspect your UTF-8 string is being stripped of its encoding before
>>> write, and so assumed to be in the system native encoding, and then
>>> re-encoded as UTF-8 when written to the file. You can see something
>>> similar with:
>>>
>>>     > tmp <- 'é'
>>>     > tmp <- iconv(tmp, to = 'UTF-8')
>>>     > Encoding(tmp) <- "unknown"
>>>     > charToRaw(iconv(tmp, to = "UTF-8"))
>>>     [1] c3 83 c2 a9
>>>
>>> It's worth saying that:
>>>
>>>     file(..., encoding = "UTF-8")
>>>
>>> means "attempt to re-encode strings as UTF-8 when writing to this
>>> file". However, if you already know your text is UTF-8, then you
>>> likely want to avoid opening a connection that might attempt to
>>> re-encode the input. Conversely (assuming I'm understanding the
>>> documentation correctly)
>>>
>>>     file(..., encoding = "native.enc")
>>>
>>> means "assume that strings are in the native encoding, and hence
>>> translation is unnecessary". Note that it does not mean "attempt to
>>> translate strings to the native encoding".
>>
>> If all that is true I think ?file needs some attention. I've read it
>> several times now and I just don't see how it can be interpreted as
>> you've described it.
>>
>> Best,
>> Ista
>>
>>>
>>> Also note that writeLines(..., useBytes = FALSE) will explicitly
>>> translate to the current encoding before sending bytes to the
>>> requested connection. In other words, there are two locations where
>>> translation might occur in your example:
>>>
>>>    1) In the call to writeLines(),
>>>    2) When characters are passed to the connection.
>>>
>>> In your case, it sounds like translation should be suppressed at both steps.
>>>
>>> I think this is documented correctly in ?writeLines (and also the
>>> Encoding section of ?file), but the behavior may feel unfamiliar at
>>> first glance.
>>>
>>> Kevin
>>>
>>> On Wed, Feb 14, 2018 at 11:36 PM, Davor Josipovic <dav...@live.com> wrote:
>>>>
>>>> I think this behavior is inconsistent with the documentation:
>>>>
>>>>   tmp <- 'é'
>>>>   tmp <- iconv(tmp, to = 'UTF-8')
>>>>   print(Encoding(tmp))
>>>>   print(charToRaw(tmp))
>>>>   tmpfilepath <- tempfile()
>>>>   writeLines(tmp, con = file(tmpfilepath, encoding = 'UTF-8'), useBytes = 
>>>> TRUE)
>>>>
>>>> [1] "UTF-8"
>>>> [1] c3 a9
>>>>
>>>> Raw text as hex: c3 83 c2 a9
>>>>
>>>> If I switch to useBytes = FALSE, then the variable is written correctly as 
>>>>  c3 a9.
>>>>
>>>> Any thoughts? This behavior is related to this issue: 
>>>> https://github.com/yihui/knitr/issues/1509
>>>>
>>>>
>>>>         [[alternative HTML version deleted]]
>>>>
>>>> ______________________________________________
>>>> R-devel@r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>> ______________________________________________
>>> R-devel@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] writeLines argument useBytes = TRUE still making conversions

Reply via email to