On Friday 20 January 2012, Simon Urbanek wrote: > On Jan 19, 2012, at 6:39 PM, Thomas Zumbrunn wrote: > > On Thursday 19 January 2012, peter dalgaard wrote: > >> On Jan 18, 2012, at 23:54 , Thomas Zumbrunn wrote: > >>> plain("Zürich") ## works > >>> plain("Z\u00BCrich") ## fails > >>> escaped("Zürich") ## fails > >>> escaped("Z\u00BCrich") ## works > >> > >> Using the correct UTF-8 code helps quite a bit: > >> > >> U+00BC ¼ c2 bc VULGAR FRACTION ONE QUARTER > >> U+00FC ü c3 bc LATIN SMALL LETTER U WITH DIAERESIS > > > > Thank you for pointing that out. How embarrassing - I systematically used > > the wrong representations. Even worse, I didn't carefully read "Writing > > R Extensions" which speaks of "Unicode as \uxxxx escapes" rather than > > "UTF-8 as \uxxxx escapes", so e.g. looking up the UTF-16 byte > > representations would have done the trick. > > > > I didn't find a recommended method of replacing non-ASCII characters with > > Unicode \uxxxx escape sequences and ended up using the Unix command line > > tool "iconv". However, the iconv version installed on my GNU/Linux > > machine (openSUSE 11.4) seems to be outdated and doesn't support the > > very useful "-- unicode-subst" option yet. I installed "libiconv" from > > http://www.gnu.org/software/libiconv/, and now I can easily replace all > > non- > > > > ASCII characters in my UTF-8 encoded R files with: > > iconv -f UTF-8 -t ASCII --unicode-subst="\u%04X" my-utf-8-encoded-file.R > > You can actually do that with R alone: > > ## you'll have to make sure that you're in C locale so R does the conversion > for you > Sys.setlocale(,"C") > > utf8conv <- function(conn) > gsub("<U\\+([0-9A-F]{4})>","\\\\u\\1",capture.output(writeLines(readLines(conn,encoding="UTF-8")))) > > > writeLines(utf8conv("test.txt")) > > M\u00F6gliche L\u00F6sung > ne nebezpe\u010Dn\u00E9 > > Cheers, > Simon
Thanks for the above function (which I wouldn't have managed to construct, ever...). Maybe this is worth mentioning in the "Writing R Extensions" manual (next to where the \uxxxx Unicode escape sequences are mentioned). Thomas ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel