[Rd] Character Encoding: Why are valid Windows-1252 characters encoded as invalid ISO-8859-1 characters?

Jason Wood Wed, 20 Mar 2013 12:52:02 -0700

Something that looks like a bug to me - but as there may be a documented
reason I have missed, I wanted to ask about it here first.  Please let me
know if this looks like something I should submit as a bug, if not, why
this behavior is intended.


Using RGui v2.15.3, 64bit, on a Windows 7 machine with US English locale

You can see the behavior I describe in the following
--------------------

> Sys.getlocale("LC_CTYPE") # my default encoding is windows code page 1252
[1] "English_United States.1252"

> localeToCharset() # R thinks the best character set to use is ISO8859-1,
a subset of windows-1252
[1] "ISO8859-1"

> x<-"\x92" # I create a 'right quote' character, using a value valid in
windows-1252 but NOT VALID in ISO8859-1
> Encoding(x) # R has chosen to encode it as 'latin1' which seems to be a
synonym for ISO8859-1
[1] "latin1"
> x # Even tho character is invalid in latin1, it renders as if it were the
valid windows-1252 character
[1] ""

> enc2utf8(x) # Encoding as UTF-8 gives us, not a valid UTF-8 'right quote'
(/u2019), but the undefined unicode character 'PRIVATE USE TWO'
[1] "\u0092"

> enc2native(enc2utf8(x)) # Moving the UTF-8 to back to the native encoding
correctly shows that it can't render the 'PRIVATE USE TWO' character in
windows-1252
[1] "<U+0092>"

---------------------

I think the problem occurs when R decides that the valid 1252 character
should be represented by default in a 'Latin1' (ISO8859-1) encoded string
rather than the native 'windows-1252'
Note that if we force the encoding to stay native, everything works fine:
----------------------
> Encoding(x)<-"unknown" # Force the encoding to the native 1252
> enc2utf8(x) # Encoding as UTF-8 now gives us the valid UTF-8 'right
quote' character
[1] ""
> enc2native(enc2utf8(x)) # and going back to the native encoding works
exactly as it should
[1] ""

        [[alternative HTML version deleted]]

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] Character Encoding: Why are valid Windows-1252 characters encoded as invalid ISO-8859-1 characters?

Reply via email to