Seymour J Metz wrote, in part: >You seem to be confirming what I wrote; if the locale is UTF-8 then >your character data should be UTF-8. The ¬ character in UTF-8 has a >different encoding from the ¬ character in Unicode, so there is no >issue of a zero octet. '00AC'X is not a valid UTF-8 string.
There is no encoding in Unicode. Thats the point, and is why you cant say AC and expect it to be meaningful. Folks might guess, but (especially with endianness) might get it wrong. 00acx is indeed invalid as a UTF-8 encoded value, but its not the 00 thats bad (thats fine, its a null): its the AC, which is invalid because the top two bits are 10 and that means its a continuation byte. So a UTF-8 parser chugs along, sees the 00 and says OK, good, thats a single-byte encoding of a null. But then it looks at the AC and says Hey, this is supposed to be a continuation, and Im not IN a multi-byte encoded character, thats no good. (One of the cool things about UTF-8 is that, assuming proper UTF-8, you can start in the middle of a string and, if you find youre at a continuation byte, you can back up to the first byte in the tuple and start from there!) The above assumes big-endian, of course. Ima keep harping on Unicode is not an encoding because it isnt and it matters. Former manager beat that into my head, and it took a while for me to get it, so if youve bothered to read this far and feel like its an artificial distinction, I dig. Its not. ---------------------------------------------------------------------- For IBM-MAIN subscribe / signoff / archive access instructions, send email to [email protected] with the message: INFO IBM-MAIN
