On May 7, 2023, at 11:41:45, Phil Smith III wrote:
> ...
> This is especially confusing since “plain ol’ ASCII” maps directly to the
> first part of UTF-8-encoded Unicode. This is of course A Good Thing in
> general, but lets people cheat and get away with it—until they don’t.
>
Yup. In MacOS, sed regex counts UTF-8 characters; printf counts octets:
516 $ printf '%3s|\n%3s|\n' 2π r
2π|
r|
517 $
> As for your original question, I’m more than willing to believe in some code
> page with hex AA as the NOT sign, just never seen it. Hard to search for,
> too, alas. Do you know what page that is?
Host: UTF-8 output: CP852
0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240
0 10 20 30 40 50 60 70 80 90 A0 B0 C0 D0 E0 F0
0 0 0 @ P ` p á ░ └ đ Ó
1 1 ! 1 A Q a q í ▒ ┴ Đ ß ˝
2 2 " 2 B R b r ó ▓ ┬ Ď Ô ˛
3 3 # 3 C S c s ú │ ├ Ë Ń ˇ
4 4 $ 4 D T d t Ą ┤ ─ ď ń ˘
5 5 % 5 E U e u ą Á ┼ Ň ň §
6 6 & 6 F V f v Ž Â Ă Í Š ÷
7 7 ' 7 G W g w ž Ě ă Î š ¸
8 8 ( 8 H X h x Ę Ş ╚ ě Ŕ °
9 9 ) 9 I Y i y ę ╣ ╔ ┘ Ú ¨
10 A * : J Z j z ¬ ║ ╩ ┌ ŕ ˙
11 B + ; K [ k { ź ╗ ╦ █ Ű ű
12 C , < L \ l | Č ╝ ╠ ▄ ý Ř
13 D - = M ] m } Ż ═ Ţ Ý ř
14 E . > N ^ n ~ « ż ╬ Ů ţ ■
15 F / ? O _ o » ┐ ¤ ▀ ´
> I’m a bit chary* of blindly accepting multiple code points as NOT signs.
> Better to know how your input is encoded (or mandate it). Unless, of course,
> it can be demonstrated that this particular multilingualism cannot be
> misinterpreted.
+1
--
gil
----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN