jm et al

I've been using R for a phonology class this term and one of the students using 
windows had a similar problem. The file he was trying to read was encoded in 
UTF-16LE, but in some ambiguous fashion. None of the other students using 
windows or linux could get it to read properly. In the end, I read it in on my 
mac with that encoding, wrote it out again as utf-8, he read it in with no 
encoding set, and then converted it to utf-8 once he'd read it in.

I'd love to say the moral is macs are king, but I think the actual moral is 
encodings are voodoo.

mike h.

> On Apr 21, 2015, at 8:27 PM, Lngmyers <lngmy...@ccu.edu.tw> wrote:
> 
> 
> Sorry if this is covered somewhere, but I've been searching for a solution in 
> vain. I'm trying to do a Chinese text analysis involving what I think are 
> 4-byte Unicode characters, like this: "𩰬". In case that turns into garbage 
> for you, here's a 3-byte Unicode character for comparison: "鬲". (My guess 
> that the "bad" characters like "𩰬" are 4-byte comes from 
> http://en.wikipedia.org/wiki/UTF-8, though I don't know how to test this 
> hypothesis.)
> 
> The original text file is encoded in UTF-8, and I'm working in traditional 
> Chinese Windows 7. It doesn't matter if I turn on "Message translations" when 
> installing R (i.e. use Chinese for the GUI).
> 
> So here's the problem.
> 
> When I load the text file using readLines("file.txt", encoding="UTF-8"), R 
> turns "𩰬" into "\xf0©°¬". (If I don't set any encoding, none of the Chinese 
> characters make it.) Even if I copy/paste this character into the R terminal, 
> it turns into "??" (normal characters work fine). Setting the encoding to 
> "UTF-16" makes things worse. I looked at specialized text packages like tau 
> or stringi but they don't seem to help (though maybe I gave up too soon).
> 
> I suppose I can work around this problem - maybe it doesn't matter that R 
> can't display the characters if it can still distinguish them (allowing me to 
> count stuff etc). There aren't a huge number of these characters anyway - 
> fewer than 300, out of a database of over 21,000 characters. But I'm worried 
> that being unable to represent them properly may be messing with my stats in 
> unexpected ways.
> 
> Not only that, the 4-byte Unicode limitation seems to imply that R (in 
> Windows) can't handle emojis!
> 
> I hope you're not going to say "Use Linux"....
> 
> -- 
> James Myers
> Graduate Institute of Linguistics
> National Chung Cheng University
> 168 University Road, Min-Hsiung
> Chia-Yi 62102
> TAIWAN
> Email:  lngmy...@ccu.edu.tw
> Web:    http://www.ccunix.ccu.edu.tw/~lngmyers/
> Phone:  886-5-272-9251
> Fax:    886-5-272-1654


Reply via email to