jm et al I've been using R for a phonology class this term and one of the students using windows had a similar problem. The file he was trying to read was encoded in UTF-16LE, but in some ambiguous fashion. None of the other students using windows or linux could get it to read properly. In the end, I read it in on my mac with that encoding, wrote it out again as utf-8, he read it in with no encoding set, and then converted it to utf-8 once he'd read it in.
I'd love to say the moral is macs are king, but I think the actual moral is encodings are voodoo. mike h. > On Apr 21, 2015, at 8:27 PM, Lngmyers <lngmy...@ccu.edu.tw> wrote: > > > Sorry if this is covered somewhere, but I've been searching for a solution in > vain. I'm trying to do a Chinese text analysis involving what I think are > 4-byte Unicode characters, like this: "𩰬". In case that turns into garbage > for you, here's a 3-byte Unicode character for comparison: "鬲". (My guess > that the "bad" characters like "𩰬" are 4-byte comes from > http://en.wikipedia.org/wiki/UTF-8, though I don't know how to test this > hypothesis.) > > The original text file is encoded in UTF-8, and I'm working in traditional > Chinese Windows 7. It doesn't matter if I turn on "Message translations" when > installing R (i.e. use Chinese for the GUI). > > So here's the problem. > > When I load the text file using readLines("file.txt", encoding="UTF-8"), R > turns "𩰬" into "\xf0©°¬". (If I don't set any encoding, none of the Chinese > characters make it.) Even if I copy/paste this character into the R terminal, > it turns into "??" (normal characters work fine). Setting the encoding to > "UTF-16" makes things worse. I looked at specialized text packages like tau > or stringi but they don't seem to help (though maybe I gave up too soon). > > I suppose I can work around this problem - maybe it doesn't matter that R > can't display the characters if it can still distinguish them (allowing me to > count stuff etc). There aren't a huge number of these characters anyway - > fewer than 300, out of a database of over 21,000 characters. But I'm worried > that being unable to represent them properly may be messing with my stats in > unexpected ways. > > Not only that, the 4-byte Unicode limitation seems to imply that R (in > Windows) can't handle emojis! > > I hope you're not going to say "Use Linux".... > > -- > James Myers > Graduate Institute of Linguistics > National Chung Cheng University > 168 University Road, Min-Hsiung > Chia-Yi 62102 > TAIWAN > Email: lngmy...@ccu.edu.tw > Web: http://www.ccunix.ccu.edu.tw/~lngmyers/ > Phone: 886-5-272-9251 > Fax: 886-5-272-1654