Something that looks like a bug to me - but as there may be a documented reason I have missed, I wanted to ask about it here first. Please let me know if this looks like something I should submit as a bug, if not, why this behavior is intended.
Using RGui v2.15.3, 64bit, on a Windows 7 machine with US English locale You can see the behavior I describe in the following -------------------- > Sys.getlocale("LC_CTYPE") # my default encoding is windows code page 1252 [1] "English_United States.1252" > localeToCharset() # R thinks the best character set to use is ISO8859-1, a subset of windows-1252 [1] "ISO8859-1" > x<-"\x92" # I create a 'right quote' character, using a value valid in windows-1252 but NOT VALID in ISO8859-1 > Encoding(x) # R has chosen to encode it as 'latin1' which seems to be a synonym for ISO8859-1 [1] "latin1" > x # Even tho character is invalid in latin1, it renders as if it were the valid windows-1252 character [1] "" > enc2utf8(x) # Encoding as UTF-8 gives us, not a valid UTF-8 'right quote' (/u2019), but the undefined unicode character 'PRIVATE USE TWO' [1] "\u0092" > enc2native(enc2utf8(x)) # Moving the UTF-8 to back to the native encoding correctly shows that it can't render the 'PRIVATE USE TWO' character in windows-1252 [1] "<U+0092>" --------------------- I think the problem occurs when R decides that the valid 1252 character should be represented by default in a 'Latin1' (ISO8859-1) encoded string rather than the native 'windows-1252' Note that if we force the encoding to stay native, everything works fine: ---------------------- > Encoding(x)<-"unknown" # Force the encoding to the native 1252 > enc2utf8(x) # Encoding as UTF-8 now gives us the valid UTF-8 'right quote' character [1] "" > enc2native(enc2utf8(x)) # and going back to the native encoding works exactly as it should [1] "" [[alternative HTML version deleted]]
______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel