It's known that R for Windows doesn't cope with UTF-8 very well. I've
raised this issue here:
http://stackoverflow.com/questions/19877676/write-utf-8-files-from-r

For this reason, I use R on a UNIX system whenever I need characters
outside of the Latin 1 encoding.

Sverre

On Wed, Apr 22, 2015 at 6:37 AM, Dan McCloy <drmcc...@uw.edu> wrote:
> You can tell for sure if it's a 4-byte character using stringi:
>
> stringi::stri_trans_general("𩰬", "Any-Hex/Unicode")
> # result is U+29C2C
>
> Since it is within the range of U+10000 to U+1FFFFF, it is indeed a 4-byte
> character.  Regarding the proper display of such characters in R: all I can
> say is that it works on Linux in RStudio and in R in the terminal.  I don't
> know how to get it to work on Windows, and all my experience with Windows
> suggests that the best you can hope for is for the characters to display OK
> in the RStudio script window, and _maybe_ on the plots, but definitely not
> in the console window.  Regarding whether the display issue affects R's
> ability do do stats: I'm pretty sure the answer is that the stats will come
> out fine.  If you're really nervous about it, do a
> stringi::stri_replace_all_fixed() for each of the 4-byte characters to give
> them aliases, re-run the stats, and see if they're different.
> -- dan
>
> Daniel McCloy
> http://dan.mccloy.info/
> Postdoctoral Research Fellow
> Institute for Learning and Brain Sciences
> University of Washington
>
>
>
> On Wed, Apr 22, 2015 at 11:27 AM, Lngmyers <lngmy...@ccu.edu.tw> wrote:
>>
>>
>> Sorry if this is covered somewhere, but I've been searching for a solution
>> in vain. I'm trying to do a Chinese text analysis involving what I think are
>> 4-byte Unicode characters, like this: "𩰬". In case that turns into garbage
>> for you, here's a 3-byte Unicode character for comparison: "鬲". (My guess
>> that the "bad" characters like "𩰬" are 4-byte comes from
>> http://en.wikipedia.org/wiki/UTF-8, though I don't know how to test this
>> hypothesis.)
>>
>> The original text file is encoded in UTF-8, and I'm working in traditional
>> Chinese Windows 7. It doesn't matter if I turn on "Message translations"
>> when installing R (i.e. use Chinese for the GUI).
>>
>> So here's the problem.
>>
>> When I load the text file using readLines("file.txt", encoding="UTF-8"), R
>> turns "𩰬" into "\xf0©°¬". (If I don't set any encoding, none of the Chinese
>> characters make it.) Even if I copy/paste this character into the R
>> terminal, it turns into "??" (normal characters work fine). Setting the
>> encoding to "UTF-16" makes things worse. I looked at specialized text
>> packages like tau or stringi but they don't seem to help (though maybe I
>> gave up too soon).
>>
>> I suppose I can work around this problem - maybe it doesn't matter that R
>> can't display the characters if it can still distinguish them (allowing me
>> to count stuff etc). There aren't a huge number of these characters anyway -
>> fewer than 300, out of a database of over 21,000 characters. But I'm worried
>> that being unable to represent them properly may be messing with my stats in
>> unexpected ways.
>>
>> Not only that, the 4-byte Unicode limitation seems to imply that R (in
>> Windows) can't handle emojis!
>>
>> I hope you're not going to say "Use Linux"....
>>
>> --
>> James Myers
>> Graduate Institute of Linguistics
>> National Chung Cheng University
>> 168 University Road, Min-Hsiung
>> Chia-Yi 62102
>> TAIWAN
>> Email:  lngmy...@ccu.edu.tw
>> Web:    http://www.ccunix.ccu.edu.tw/~lngmyers/
>> Phone:  886-5-272-9251
>> Fax:    886-5-272-1654
>
>

Reply via email to