> On Jul 7, 2016, at 11:40 AM, Hadley Wickham <h.wick...@gmail.com> wrote: > > On Thu, Jul 7, 2016 at 10:11 AM, Duncan Murdoch > <murdoch.dun...@gmail.com> wrote: >> On 07/07/2016 10:57 AM, Hadley Wickham wrote: >>> >>> If you print: >>> >>> "\xc9\x82\xbf" >>> >>> you get >>> >>> "\u0242\xbf" >>> >>> But if you try and evaluate that string you get: >>> >>>> "\u0242\xbf" >>> >>> Error: mixing Unicode and octal/hex escapes in a string is not allowed >>> >>> (Probably will only happen on mac/linux with default utf-8 encoding) >> >> >> I'm not sure what should happen here, but that's not a legal string in a >> UTF-8 locale, so it's not too surprising that things go wonky. > > Here's bit more context on how I got that sequence of bytes: > > x <- "こんにちは" > y <- iconv(x, to = "Shift-JIS") > Encoding(y) > y > > I did this to create an example to demonstrate how to handle encoding > problems, and it's bit frustrating that I have to manually mangle the > string in order to be able to re-use it in another session. Maybe > strings with unknown encoding shouldn't use unicode escapes? >
The real issue is that the only supported encoding of strings in R are native (=current locale), latin1, and UTF-8. So unless you're running in Shift-JIS locale, that encoding is not supported in your R, so the result of the iconv() above is not a valid R string, just a sequence of bytes that R doesn't know how to deal with. It tries to interpret it in your locale (UTF-8) just as a guess, but that doesn't quite work. To illustrate, doing this in C locale yields a different result: > x [1] "<U+3053><U+3093><U+306B><U+3061><U+306F>" > y <- iconv(x, from="UTF-8", to = "Shift-JIS") > y [1] "\202\261\202\361\202\311\202\277\202\315" If you want a result that does not depend on your locale and is none of the supported encodings, you have to declare it as bytes (back in UTF-8): > Encoding(y)="bytes" > y [1] "\\x82\\xb1\\x82\\xf1\\x82\\xc9\\x82\\xbf\\x82\\xcd" > iconv(y, from="Shift-JIS", to="utf-8") [1] "こんにちは" But that has its own perils such as the fact that you cannot dput() byte-encoded strings. Cheers, Simon ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel