Re: Running USS with locale other than 1047

Rick Troth Fri, 14 Oct 2016 13:25:49 -0700

Some history, and some hope.


On 10/13/16 16:24, Paul Gilmartin wrote:

Hmmm... You asked about Danish,
but your Mail Agent seems to be speaking Finnish.

:-)

The advantage in the non-EBCDIC* world is that the lower half of 8-bitspace is rather more consistent. And that space is where we have someserious trouble on this side of the line (pipe symbol versusexclamation, square brackets, curly braces).

Years ago, Edwin Hart (then at JHU APL) and others worked through SHAREto normalize EBCDIC into a code page which could be translated to/fromnon-EBCDIC* consistently and reliably. We've discussed it in thelists/fora, perhaps this particular list/forum, even recently. (I'veslept since then.) The result of the SHARE effort was what some call"Code Page 37 version 2". IBM never fully took-up the customer-producedcode page, but they did listen and they gave us CP 1047.

Outside of IBM, most have an affinity for a _one-to-one reversiblemapping_ which treats the EBCDIC side as CP37V2 and the non-EBCDIC* sideas ISO-8859-1. This doesn't help the Poles, I suppose. (It would havebeen nice if IBM had a Polish code page which could use the /sametranslate table/ and match-up with a Polish non-EBCDIC code page.)

Witness Dignus: aside from newline (see below) their default/translation is the same/ as that gleaned from this two-decades-oldSHARE effort. Nice work. Good job.

CP 1047 is the best we have, if we are to live in the world IBM hascreated for us.(And some people accept the "CP1047" tag even though they're reallytalking CP37V2.)

Sadly, CP 1047 doesn't help the Poles (nor the Danes, nor the Finns).
But now it appears we can change locale. Fabulous!

Thankfully locale variables (LANG, LC_CTYPE, et al) are indicated usingan even smaller subset of EBCDIC than those code points which map from"low order non-EBCDIC".

There is still the problem that a stream of bytes might not berecognized. Tagging files with charset ABC or code page 123 is clumsy atbest.


*Here's hope: *

Newline is always non-printable whether EBCDIC or non-EBCDIC*.

Given a stream of bytes of unknown meaning (but reasonably expecting"plain text") on can trigger on 0x15 and be reasonably sure thepreceding is EBCDIC or trigger on 0x0A and be reasonably sure thepreceding is not. (And one can strip off or append 0x0D as needed.)

If the content is a shell script, locale variables can be recognized andrespected.XML, HTML, and source code can trivially include reliable cues to theproper locale for rendering.

Again, for a byte stream text file, look for EBCDIC "NL" newline or lookfor non-EBCDIC "LF" linefeed. EBCDIC NL will never appear in non-EBCDICprintable plain text. Non-EBCDIC LF will never appear in EBCDICprintable plain text. It's a good test.

This is where even Dignus doesn't quite get it: They translate EBCDIC0x15 to non-EBCDIC 0x0A. (Actual non-EBCDIC for "newline" is 0x85.) Buttheir table only helps with the above test, and _makes sense_ for caseswhere someone did an un-measured translation. So I can't fault them.

Once the result of the EBCDIC (or not) check is known, one can applylocale and "convert" appropriately. i.e., beyond the cramped walls of8-bit space.


-- R; <><

* I say non-EBCDIC here because "ASCII" has baggage for many. Y'all knowwhat I mean.








----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN

Re: Running USS with locale other than 1047

Reply via email to