It's known that R for Windows doesn't cope with UTF-8 very well. I've raised this issue here: http://stackoverflow.com/questions/19877676/write-utf-8-files-from-r
For this reason, I use R on a UNIX system whenever I need characters outside of the Latin 1 encoding. Sverre On Wed, Apr 22, 2015 at 6:37 AM, Dan McCloy <drmcc...@uw.edu> wrote: > You can tell for sure if it's a 4-byte character using stringi: > > stringi::stri_trans_general("𩰬", "Any-Hex/Unicode") > # result is U+29C2C > > Since it is within the range of U+10000 to U+1FFFFF, it is indeed a 4-byte > character. Regarding the proper display of such characters in R: all I can > say is that it works on Linux in RStudio and in R in the terminal. I don't > know how to get it to work on Windows, and all my experience with Windows > suggests that the best you can hope for is for the characters to display OK > in the RStudio script window, and _maybe_ on the plots, but definitely not > in the console window. Regarding whether the display issue affects R's > ability do do stats: I'm pretty sure the answer is that the stats will come > out fine. If you're really nervous about it, do a > stringi::stri_replace_all_fixed() for each of the 4-byte characters to give > them aliases, re-run the stats, and see if they're different. > -- dan > > Daniel McCloy > http://dan.mccloy.info/ > Postdoctoral Research Fellow > Institute for Learning and Brain Sciences > University of Washington > > > > On Wed, Apr 22, 2015 at 11:27 AM, Lngmyers <lngmy...@ccu.edu.tw> wrote: >> >> >> Sorry if this is covered somewhere, but I've been searching for a solution >> in vain. I'm trying to do a Chinese text analysis involving what I think are >> 4-byte Unicode characters, like this: "𩰬". In case that turns into garbage >> for you, here's a 3-byte Unicode character for comparison: "鬲". (My guess >> that the "bad" characters like "𩰬" are 4-byte comes from >> http://en.wikipedia.org/wiki/UTF-8, though I don't know how to test this >> hypothesis.) >> >> The original text file is encoded in UTF-8, and I'm working in traditional >> Chinese Windows 7. It doesn't matter if I turn on "Message translations" >> when installing R (i.e. use Chinese for the GUI). >> >> So here's the problem. >> >> When I load the text file using readLines("file.txt", encoding="UTF-8"), R >> turns "𩰬" into "\xf0©°¬". (If I don't set any encoding, none of the Chinese >> characters make it.) Even if I copy/paste this character into the R >> terminal, it turns into "??" (normal characters work fine). Setting the >> encoding to "UTF-16" makes things worse. I looked at specialized text >> packages like tau or stringi but they don't seem to help (though maybe I >> gave up too soon). >> >> I suppose I can work around this problem - maybe it doesn't matter that R >> can't display the characters if it can still distinguish them (allowing me >> to count stuff etc). There aren't a huge number of these characters anyway - >> fewer than 300, out of a database of over 21,000 characters. But I'm worried >> that being unable to represent them properly may be messing with my stats in >> unexpected ways. >> >> Not only that, the 4-byte Unicode limitation seems to imply that R (in >> Windows) can't handle emojis! >> >> I hope you're not going to say "Use Linux".... >> >> -- >> James Myers >> Graduate Institute of Linguistics >> National Chung Cheng University >> 168 University Road, Min-Hsiung >> Chia-Yi 62102 >> TAIWAN >> Email: lngmy...@ccu.edu.tw >> Web: http://www.ccunix.ccu.edu.tw/~lngmyers/ >> Phone: 886-5-272-9251 >> Fax: 886-5-272-1654 > >