Charles: Good points well made. Yes, I agree that UTF-16 offers no advantage to me. UTF-32 has to be considered for performance in string handling functions. I may end up defaulting to UTF-8 on disc, and converting to the others when needed.
The system's source and compiler (crude but working) are all written in 7-bit ASCII to keep things simple, but data can be any value--I'm not a big fan of stringz :-) To be frank, I doubt it will have more than 1 user, but I won't be happy until I can write and print my CV on it, so I might as well make some sensible decisions now :-) Thank you. Rupert On Thu., Aug. 20, 2020, 15:17 Charles Mills, <[email protected]> wrote: > Not exactly the question you asked, but IMHO if one were writing a > "system" (OS, DBMS, application family) today one would be foolish to > restrict one's customers to 95 or so printable characters. You would be (1) > writing off all of Asia and (2) condemning much of Europe and northern > Africa to either second class status, or the constant code page shuffle > ("what character is x'80'? Well, it depends where you are.") > > Other than the above you have three choices: > > - UTF-8, which will represent every character in the world, is almost as > compact as ASCII, and can be treated as ASCII for quick-and-dirty purposes > like debugging displays. What you give up is the comforting knowledge that > characters are always, always, always one to one with bytes. > > - UTF-32. Like UTF-8, but you gain a fixed relationship between characters > and bytes (1:4) at a cost in storage. You might counter that storage is > cheap these days. > > - I am not a Windows-basher, but I think Windows' choice of UTF-16 is the > worst of both worlds. It consumes twice the storage of ASCII, with the > tradeoff that you can almost, almost, almost count on a fixed relationship > between characters and bytes (1:2). The problem is that you cannot quite > count on it -- some characters are 32 bits -- and if you have supported > code that is running out in the field you know that code that works 99.9% > of the time is much more problematic than code that works 95% of the time > (as would a routine that assumed UTF-8 was 1:1 with bytes). > > Most Web pages, the Go Language, and Db2 (I am told) all use UTF-8 > internally. > > Charles > > > -----Original Message----- > From: IBM Mainframe Discussion List [mailto:[email protected]] On > Behalf Of Rupert Reynolds > Sent: Thursday, August 20, 2020 5:55 AM > To: [email protected] > Subject: EBCDIC and other systems > > I'm writing a new OS for PC hardware (an exercise started during > lockdown/furlough) and I wondered about files from other systems. Is there > much in DBCS on mainframe systems these days, or is it still mainly the > same old 8-bit EBCDIC, please? > > I still have to decide whether to support UTF-8 and/or UTF-32, of course > :-) > > ---------------------------------------------------------------------- > For IBM-MAIN subscribe / signoff / archive access instructions, > send email to [email protected] with the message: INFO IBM-MAIN > ---------------------------------------------------------------------- For IBM-MAIN subscribe / signoff / archive access instructions, send email to [email protected] with the message: INFO IBM-MAIN
