Some history, and some hope.

On 10/13/16 16:24, Paul Gilmartin wrote:
Hmmm... You asked about Danish,
but your Mail Agent seems to be speaking Finnish.


The advantage in the non-EBCDIC* world is that the lower half of 8-bit space is rather more consistent. And that space is where we have some serious trouble on this side of the line (pipe symbol versus exclamation, square brackets, curly braces).

Years ago, Edwin Hart (then at JHU APL) and others worked through SHARE to normalize EBCDIC into a code page which could be translated to/from non-EBCDIC* consistently and reliably. We've discussed it in the lists/fora, perhaps this particular list/forum, even recently. (I've slept since then.) The result of the SHARE effort was what some call "Code Page 37 version 2". IBM never fully took-up the customer-produced code page, but they did listen and they gave us CP 1047.

Outside of IBM, most have an affinity for a _one-to-one reversible mapping_ which treats the EBCDIC side as CP37V2 and the non-EBCDIC* side as ISO-8859-1. This doesn't help the Poles, I suppose. (It would have been nice if IBM had a Polish code page which could use the /same translate table/ and match-up with a Polish non-EBCDIC code page.)

Witness Dignus: aside from newline (see below) their default /translation is the same/ as that gleaned from this two-decades-old SHARE effort. Nice work. Good job.

CP 1047 is the best we have, if we are to live in the world IBM has created for us. (And some people accept the "CP1047" tag even though they're really talking CP37V2.)
Sadly, CP 1047 doesn't help the Poles (nor the Danes, nor the Finns).
But now it appears we can change locale. Fabulous!

Thankfully locale variables (LANG, LC_CTYPE, et al) are indicated using an even smaller subset of EBCDIC than those code points which map from "low order non-EBCDIC".

There is still the problem that a stream of bytes might not be recognized. Tagging files with charset ABC or code page 123 is clumsy at best.

*Here's hope: *

Newline is always non-printable whether EBCDIC or non-EBCDIC*.
Given a stream of bytes of unknown meaning (but reasonably expecting "plain text") on can trigger on 0x15 and be reasonably sure the preceding is EBCDIC or trigger on 0x0A and be reasonably sure the preceding is not. (And one can strip off or append 0x0D as needed.)

If the content is a shell script, locale variables can be recognized and respected. XML, HTML, and source code can trivially include reliable cues to the proper locale for rendering.

Again, for a byte stream text file, look for EBCDIC "NL" newline or look for non-EBCDIC "LF" linefeed. EBCDIC NL will never appear in non-EBCDIC printable plain text. Non-EBCDIC LF will never appear in EBCDIC printable plain text. It's a good test.

This is where even Dignus doesn't quite get it: They translate EBCDIC 0x15 to non-EBCDIC 0x0A. (Actual non-EBCDIC for "newline" is 0x85.) But their table only helps with the above test, and _makes sense_ for cases where someone did an un-measured translation. So I can't fault them.

Once the result of the EBCDIC (or not) check is known, one can apply locale and "convert" appropriately. i.e., beyond the cramped walls of 8-bit space.

-- R; <><

* I say non-EBCDIC here because "ASCII" has baggage for many. Y'all know what I mean.

For IBM-MAIN subscribe / signoff / archive access instructions,
send email to with the message: INFO IBM-MAIN

Reply via email to