> “AC” is meaningless in a Unicode context.

In the context of a Unicode code point, "AC" is a perfectly unambiguous 
abbreviation for U+00AC. In any other context,not so much.

> This is especially confusing since “plain ol’ ASCII” maps directly to the 
> first part of UTF-8-encoded Unicode. 
> This is of course A Good Thing in general, but lets people cheat and get away 
> with it—until they don’t.

As long as they understand that ASCII is a 7-bit code, they're perfectly safe.

> I’ve seen customers take data that’s UTF-8 and think it’s 8859-1. 

Ouch! Are they the same users that think that, e.g., CP850 is ASCII?

> I’m more than willing to believe in some code page with hex AA as the NOT 
> sign, just never seen it.

See, e.g., <https://en.wikipedia.org/wiki/CP850#Character_set>.


--
Shmuel (Seymour J.) Metz
http://mason.gmu.edu/~smetz3

________________________________________
From: IBM Mainframe Discussion List [IBM-MAIN@LISTSERV.UA.EDU] on behalf of 
Phil Smith III [li...@akphs.com]
Sent: Sunday, May 7, 2023 1:41 PM
To: IBM-MAIN@LISTSERV.UA.EDU
Subject: Re: Logical Nor (¬) in ASCII-based code pages?

Seymour J Metz wrote:
>I've seen Logical Not () at AA and at AC. Are there and ASCII-based
>code pages that have it at a third position? Put another way, is there
>a third code point that ooRexx and Regina should recognize as ?

And later:
>UTF-8 is just a transform of Unicode, and the Unicode code point is
>AC. The string C2AC is just a way of encoding AC.

Not quite. Yes, hex C2AC is the UTF-8 encoding of the Unicode NOT sign. Unicode 
is a list of code points and, as you said, UTF-8 is an encoding. The Unicode 
code point is U+00AC. It is NOT “AC”, nor “hex AC”. Yes, I’m being picky, but 
this matters. The point is, U+00AC—the Unicode expression of that code 
point—has a specific meaning, which then *must* be encoded somehow (UTF-8, 
UTF-16, UTF-32); “AC” is meaningless in a Unicode context.

This is especially confusing since “plain ol’ ASCII” maps directly to the first 
part of UTF-8-encoded Unicode. This is of course A Good Thing in general, but 
lets people cheat and get away with it—until they don’t.

It gets even more confusing because ISO 8859-1 *looks* like Unicode in that, 
for example, a hex AC is the NOT sign in 8859-1. But that’s 8859-1, not 
Unicode, not UTF-8. A hex AC *is not a character* in UTF-8: it’s an error. I’ve 
seen customers take data that’s UTF-8 and think it’s 8859-1. This mostly works. 
“Mostly” is not good.

As for your original question, I’m more than willing to believe in some code 
page with hex AA as the NOT sign, just never seen it. Hard to search for, too, 
alas. Do you know what page that is?

I’m a bit chary* of blindly accepting multiple code points as NOT signs. Better 
to know how your input is encoded (or mandate it). Unless, of course, it can be 
demonstrated that this particular multilingualism cannot be misinterpreted.

...phsiii

*no “char” pun intended


----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Reply via email to