Re: Logical Nor (¬) in ASCII-based code pages?

Phil Smith III Mon, 08 May 2023 11:48:41 -0700

Seymour J Metz wrote, in part:
>You seem to be confirming what I wrote; if the locale is UTF-8 then
>your character data should be UTF-8. The ¬ character in UTF-8 has a
>different encoding from the ¬ character in Unicode, so there is no
>issue of a zero octet. '00AC'X is not a valid UTF-8 string.


There is no encoding in Unicode. Thats the point, and is why you cant say 
AC and expect it to be meaningful. Folks might guess, but (especially with 
endianness) might get it wrong.

00acx is indeed invalid as a UTF-8 encoded value, but its not the 00 thats 
bad (thats fine, its a null): its the AC, which is invalid because the top 
two bits are 10 and that means its a continuation byte. So a UTF-8 parser 
chugs along, sees the 00 and says OK, good, thats a single-byte encoding of a 
null. But then it looks at the AC and says Hey, this is supposed to be a 
continuation, and Im not IN a multi-byte encoded character, thats no good. 
(One of the cool things about UTF-8 is that, assuming proper UTF-8, you can 
start in the middle of a string and, if you find youre at a continuation byte, 
you can back up to the first byte in the tuple and start from there!)

The above assumes big-endian, of course.

Ima keep harping on Unicode is not an encoding because it isnt and it 
matters. Former manager beat that into my head, and it took a while for me to 
get it, so if youve bothered to read this far and feel like its an artificial 
distinction, I dig. Its not.


----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN

Re: Logical Nor (¬) in ASCII-based code pages?

Reply via email to