On Thu, 3 Aug 2006, Thomas Kuster wrote:

Hello

Am Mittwoch, 2. August 2006 17.11 schrieb Thomas Lumley:
This sounds like a conflict between encodings -- eg if R is assuming UTF-8
and the file is encoding in Latin-1 then the sequence
U+00FC : LATIN SMALL LETTER U WITH DIAERESIS
U+0072 : LATIN SMALL LETTER R
is coded as FC72 in the file, which is an illegal byte sequence in UTF-8.

Hex:  74 65 20 66 fc 72 20 61 6c 6c 65 53 45 2f 31 36
Text:  t  e     f  ?  r     a  l  l  e  S  E  /  1  6

Ok, so that looks like Latin-1 encoding in the file

The underlying C code (being written in the US quite a long time ago)
doesn't know about encodings, and I don't know what the rules are in SPSS
for valid characters (I suspect that in these old portable file formats it
probably just reads and writes bytes, leaving it up to the OS to interpret
them.

But why stopp the C code reading? Is "/" not the endmark of the string? What
is the problem, if I chance that in the source?

You haven't shown anything that indicates that the C code stopped reading. More likely R just stops displaying when it gets to an illegal byte sequence. You could use nchar() to count the bytes in the string to find out.

You could try running R in a non-UTF-8 locale to see if it helps.

I think my local is non-UTF-8 (de_CH, isolatin). How can I check that, and set
an other temporary?

You can use charToRaw() to see what R thinks the byte sequence is for a word with a u-umlaut.

Sys.setlocale() will let you change the locale, but your locale does look non-UTF-8.

This is all guesswork since we can't see the file.

        -thomas
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to