Re: [Rd] use of UTF-8 \uxxxx escape sequences in function arguments

Simon Urbanek Thu, 19 Jan 2012 18:00:18 -0800

On Jan 19, 2012, at 6:39 PM, Thomas Zumbrunn wrote:

> On Thursday 19 January 2012, peter dalgaard wrote:
>> On Jan 18, 2012, at 23:54 , Thomas Zumbrunn wrote:
>>>  plain("Zürich")  ## works
>>>  plain("Z\u00BCrich")  ## fails
>>>  escaped("Zürich")  ## fails
>>>  escaped("Z\u00BCrich")  ## works
>> 
>> Using the correct UTF-8 code helps quite a bit:
>> 
>> U+00BC       ¼       c2 bc   VULGAR FRACTION ONE QUARTER
>> U+00FC       ü       c3 bc   LATIN SMALL LETTER U WITH DIAERESIS
> 
> Thank you for pointing that out. How embarrassing - I systematically used the 
> wrong representations. Even worse, I didn't carefully read "Writing R 
> Extensions" which speaks of "Unicode as \uxxxx escapes" rather than "UTF-8 as 
> \uxxxx escapes", so e.g. looking up the UTF-16 byte representations would 
> have 
> done the trick.
> 
> I didn't find a recommended method of replacing non-ASCII characters with 
> Unicode \uxxxx escape sequences and ended up using the Unix command line tool 
> "iconv". However, the iconv version installed on my GNU/Linux machine 
> (openSUSE 11.4) seems to be outdated and doesn't support the very useful "--
> unicode-subst" option yet. I installed "libiconv" from 
> http://www.gnu.org/software/libiconv/, and now I can easily replace all non-
> ASCII characters in my UTF-8 encoded R files with:
> 
>  iconv -f UTF-8 -t ASCII --unicode-subst="\u%04X" my-utf-8-encoded-file.R
>



You can actually do that with R alone:

## you'll have to make sure that you're in C locale so R does the conversion 
for you
Sys.setlocale(,"C")

utf8conv <- function(conn) 
gsub("<U\\+([0-9A-F]{4})>","\\\\u\\1",capture.output(writeLines(readLines(conn,encoding="UTF-8"))))

> writeLines(utf8conv("test.txt"))
M\u00F6gliche L\u00F6sung
ne nebezpe\u010Dn\u00E9

Cheers,
Simon

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] use of UTF-8 \uxxxx escape sequences in function arguments

Reply via email to