On Jan 19, 2012, at 6:39 PM, Thomas Zumbrunn wrote: > On Thursday 19 January 2012, peter dalgaard wrote: >> On Jan 18, 2012, at 23:54 , Thomas Zumbrunn wrote: >>> plain("Zürich") ## works >>> plain("Z\u00BCrich") ## fails >>> escaped("Zürich") ## fails >>> escaped("Z\u00BCrich") ## works >> >> Using the correct UTF-8 code helps quite a bit: >> >> U+00BC ¼ c2 bc VULGAR FRACTION ONE QUARTER >> U+00FC ü c3 bc LATIN SMALL LETTER U WITH DIAERESIS > > Thank you for pointing that out. How embarrassing - I systematically used the > wrong representations. Even worse, I didn't carefully read "Writing R > Extensions" which speaks of "Unicode as \uxxxx escapes" rather than "UTF-8 as > \uxxxx escapes", so e.g. looking up the UTF-16 byte representations would > have > done the trick. > > I didn't find a recommended method of replacing non-ASCII characters with > Unicode \uxxxx escape sequences and ended up using the Unix command line tool > "iconv". However, the iconv version installed on my GNU/Linux machine > (openSUSE 11.4) seems to be outdated and doesn't support the very useful "-- > unicode-subst" option yet. I installed "libiconv" from > http://www.gnu.org/software/libiconv/, and now I can easily replace all non- > ASCII characters in my UTF-8 encoded R files with: > > iconv -f UTF-8 -t ASCII --unicode-subst="\u%04X" my-utf-8-encoded-file.R >
You can actually do that with R alone: ## you'll have to make sure that you're in C locale so R does the conversion for you Sys.setlocale(,"C") utf8conv <- function(conn) gsub("<U\\+([0-9A-F]{4})>","\\\\u\\1",capture.output(writeLines(readLines(conn,encoding="UTF-8")))) > writeLines(utf8conv("test.txt")) M\u00F6gliche L\u00F6sung ne nebezpe\u010Dn\u00E9 Cheers, Simon ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel