Re: [PATCH] IBM z/OS + EBCDIC support

Daniel Richard G. Tue, 28 Apr 2015 23:15:48 -0700

On Mon, 2015 Apr 27 22:12+0000, Thorsten Glaser wrote:
> Hah!
> 
> Hi again.
> 
> Your eMail requires at least three passes…
> 
> 1 reading through all of it, taking notes
> ② this answer message, with a few comments on some things,  while
>   ignoring some other things altogether
> ③ another answer message tackling those things, after I ponder this
>   some more (it *is* a brave new world you opened!)
> 
> The result of #1+#2 follows.


Ready for it!

> Daniel Richard G. dixit:
>
> >> - what about \u20AC? UTF-8? UTF-EBCDIC?
> >
> >Many code pages have received euro-sign updates; e.g. EBCDIC 924 is
>
> I wasn’t actually asking about Euro support here, but deeper…

I'm not sure I understand what you're getting at... U+20AC is the
Euro sign...

> >I worked by way of SSH'ing in to the z/OS OMVS Unix environment.
> >Everything in OMVS is EBCDIC, but of course my SSH client sends and
> >receives everything in ASCII. There is a network translation layer in
> >between, apart from the file-content conversion layers previously
> >mentioned, that makes it all work transparently.
>
> … *UGH!* That’s the hard thing.

It's either that, or x3270 :]

> Actually, does your SSH client send/receive in ASCII, or in latin1 or
> some other ASCII-based codepage? What does this layer use?

I'm working from a system with a UTF-8 locale, but as I'm US-based,
pretty much everything is ASCII. The conversion layer, however,
explicitly uses ISO 8859-1 on the client side. If I send actual UTF-8,
that would probably get interpreted as so much Latin-1.

> >Even IBM's z/OS Unicode support via UTF-16 is, as far as I can tell,
> >for use by applications and not by logged-in users.
>
> OK.

Of course, I see no reason why mksh couldn't use this Unicode support,
as long as it continues talking ASCII/EBCDIC with the terminal.

> This would mean completely removing utf8-mode from the shell. That’s a
> more deep incision than I originally thought would be required.

Removing it? I thought off-by-default would be enough...

> z/Linux is “something like Debian/s390 and Debian/s390x”, then?
> (In that case: mksh works perfectly well there.)

Yes, exactly; z/Linux is just how I've heard it referred to in my
company. That environment is pretty trivial to port to, as It's Just
Linux(tm) with slightly different sysdeps.

> >Perhaps the "printf '\x4F'" thing can be used to detect an
> >EBCDIC build
>
> No, printf is unportable, but maybe echo something | tr a-z A-Z, which
> should differ. Though I recall at least one system not supporting
> ranges in tr, so this is more like a “check if the output is expected
> for tr on EBCDIC that does support ranges, and everything else is
> ASCI” thing, I guess.

Even if printf is unportable, the test need only succeed on EBCDIC
platforms. Instead of checking for 'O' vs. '|', check for '|' vs.
anything else (including error).

You won't get anywhere with tr(1) in EBCDIC-land, I'm afraid:

    $ echo hijk | tr a-z A-Z
    HIJK

> >I don't think it's feasible to have a single mksh binary support
> >multiple EBCDIC variants, however, so IMO this matter is best left to
> >the user's discretion in what CFLAGS they provide (-qconvlit option).
> >As long as the code specifies these characters literally instead of
> >numerically, everything should fall in line.
>
> … sounds like a maintenance nightmare. But probably doable, if we
> enumerate the set of options (to a carefully chosen, small number).

I would just have a small platform note in the documentation that calls
the user's attention to xlc's -qascii and -qconvlit options, with a
brief discussion of the ASCII vs. EBCDIC issues, and then let them
decide how to deal with it.

> >The Build.sh code wouldn't be able to suss out the signals any better if
> >it knew about these that are unique to z/OS? IBM might add even more
> >signals down the line, after all...
> 
> I don’t think so, at least NSIG should be precise, especially
> if at least one of sys_siglist, sys_signame and strsignal exists.

Pretty sure none of those are available :(  They're certainly not in
the headers.

> You could experiment things at runtime. Just kill(2) something
> with all numbers, see if high numbers give different errors,
> maybe the OS says “signal number too high”, then we get a clue.

    $ kill -120 83953851
    kill: FSUM7327 signal number 120 not conventional
    kill: 83953851: EDC5121I Invalid argument.
    $ kill -64 83953851
    kill: FSUM7327 signal number 64 not conventional
    kill: 83953851: EDC5121I Invalid argument.
    $ kill -63 83953851
    kill: FSUM7327 signal number 63 not conventional
    kill: 83953851: EDC5121I Invalid argument.
    $ kill -62 83953851
    kill: FSUM7327 signal number 62 not conventional
    kill: 83953851: EDC5121I Invalid argument.
    $ kill -40 83953851
    kill: FSUM7327 signal number 40 not conventional
    kill: 83953851: EDC5121I Invalid argument.
    $ kill -39 83953851
    kill: FSUM7327 signal number 39 not conventional
    $ kill -38 83953851
    kill: FSUM7327 signal number 38 not conventional
    $ kill -37 83953851
    kill: FSUM7327 signal number 37 not conventional

For what it's worth...

> >    "rlimits.gen", line 20.20: 1506-191 (E) The character # is not a
> >    valid C source character.
> 
> May be related to a bug in the shell running Build.sh. Try removing
> all backlash+newline from the *.opt files first, as I do in the mksh-
> R*.tgz distfiles. (read without -r is supposed to do that, but
> apparently, enough shells fail it; mid- to long-term, there is no
> way around compiling a C tool on the host to generate some things
> for mksh ☹)

Yes, rlimits.gen is lacking the continuation backslashes from
rlimits.opt. Guess those are getting dropped somewhere.

Once I flattened each of those definitions into a single line, the build
proceeds and completes without error, and the test suite...

    Total failed: 0
    Total passed: 498

I wouldn't encourage a host-side C tool here, as that was partly what
made a GNU Bash build unmanageable on this system...

> >    $ ./mksh -c 'echo zzz'; echo
> >    :::
> 
> … catching the thread from above again… this sounds like we’d want to
> completely ignore it…
>
> >I'm looking into setting up an environment that is all-ASCII,
> >starting from /bin/sh (hence why I'm here)
>
> … except for this thing.
>
> I’d hope ASCII mksh would work as-is on your system, but apparently
> you’d need a bunch of tools working in ASCII mode first before you can
> confirm that, e.g. by running the testsuite ;)
>
> Sounds like an interesting (positive sense) hacking goal! You, Sir,
> fit right into the mksh crowd (I did the Plan 9 port, as well as
> others, but lewellyn and RT from IRC have ported mksh to many funny
> and obscure platforms)!

Oh, you're very kind :)

Yes, that's the general idea... after /bin/sh, build awk, grep, sed
et al. Bootstrap a whole ASCII ecosystem from EBCDIC, and then build
more complex projects inside that ASCII environment. The biggest
challenge so far is just getting that EBCDIC<->ASCII auto-conversion
switched off, so that the ASCII doesn't get mangled.

> >> Huh, Pascal anyone? :)
> >
> >I figured, Perl and Python have picked it up, why not here too? ;)
>
> I was always more particular to ASC() and CHR$()… *ducks and hides*

chr() is there too! Not asc(), however... then you might have to have
ebc(), too ^_^

> >But that need not be active, right? I saw UTFMODE in run-time
> >conditionals all over the place.
> 
> At the current point in time, yes.
> 
> I hope to be able to make the entire of edit.c, plus good parts of
> lex.c and syn.c and some parts of tree.c use 16-bit Unicode
> internally.

I'm presuming this would be wchar_t and its related functions?

> On the other hand, we’d just need a different 8-to-16-bit conversion
> function, and back, for EBCDIC. (The bad part about *this* is that it
> messes with supporting various EBCDIC variants. The good part is that
> we possibly could use system library code for this, which we cannot on
> the POSIX/ASCII world.)

Is the idea along the lines of filtering everything through iconv(3),
going from UCS-2/UTF-16 internally to whatever encoding the scripts and
terminal are in? So the code only deals with Unicode and translates
appropriately to the outside world?

> I’ll do my part too and review parts of mksh for gems like those,
> post-R51 though.
>
> >practical concern... I mean, if you could support MacOS 9 with its
> >CR-only line endings just by tweaking a few lines of code, surely
> >that would be desirable, even if it makes an exception to the
> >"consistent across all platforms" property? In reality, it would
> >require a lot more invasive changes (not least due to the lack of
> >POSIX), and I'd anticipate _that_ to be the main objection to
> >integrating such support.
>
> Absoluely no! A port of mksh to MacRelix is undergoing. I just require
> it to use Unix line endings ;) (otherwise, we’d have another fork on
> our hands, but lewellyn seems to agree)

Well, at least MacRelix aims to be a Unix/POSIX environment, so Unix
line endings make sense there.

> MacRelix is something like Cygwin for System 7, AIUI.

I wasn't aware of this project until now. Good on them!

> >That not only makes the code EBCDIC-compatible, it arguably makes
> >it clearer. Not a separate section of code prone to bit rot, but
> >simply a particular discipline applied to the common/platform-
> >independent code.
>
> Meh. I grew up with ASM (besides GW-BASIC). And some mksh targets
> suffer from stupid compilers, like pcc; there, this does make a
> difference. But, being honest with myself, other things would make so
> much more a difference to completely hide this… thinking of an
> optimised rotation implementation (we’re going to need this for hash
> table lookups, which will only become more) for example… which is not
> going to happen, as it’d involve writing asm code for every target
> that can’t use the common C part, and I don’t want to go down that
> direction for mksh.

I'm not sure I understand the point you're making here :>  But you're
not saying that you'd need to resort to assembly for an efficient hash-
table implementation... right?

> So, I don’t quite agree with the reasoning, but I fully agree with the
> direction this takes. (I had actually thought about this very line,
> and wondered if I could make this into a macro… a bit more obscure,
> but keeping the ASCII version more compact; I like small code.)

If there were numerous case-insensitive single-char comparisons like
this, then a macro would make sense, but as far as I saw there was
only the one.

> Without going further into this, let me throw another idea into
> the room:
>
> a separate repository, with those things that would make the ifdefs
> too much, in which I’d merge each release, possibly even more often,
> from which EBCDIC releases are cut. Could even be git, in case people
> like that more; the cvs to git conversion appears to be pretty stable.

I didn't think the #ifdefs got out of hand... was there any place in
particular where you saw this to be the case? Or are you thinking of
sections of code, yet to be made EBCDIC-compatible, that are likely to
become #ifdef-fests as a result?

Early on, I thought a separate ebcdic.c source file might have been
warranted. But the amount of new code needed was fairly small, and just
conditionalizing the compilation of such a source file would have
outweighed the benefit of keeping the EBCDIC-related code separate.

> >Fair enough; I think that the changes needed to support EBCDIC would
> >weigh in a lot lighter than those needed to support Windows natively.
> >The same POSIX API is used, after all---the worst of it is the Ctrl-key
> >mapping tables.
> 
> Yes and no. Michael managed to hide much in a library.

Yet I presume you would not want to integrate such a library into the
main source tree (e.g. under a win32/ directory), irrespective of the
CR+LF/fork() issue, as you wouldn't/couldn't want to maintain such a
library yourself...?

> >Will mksh continue to support ISO 8859-1 (Latin-1) environments
>
> mksh has never supported latin-1 (or any 8-bit codepage/SBCS, or DBCS)
> environments, period.
>
> mksh is always: ASCII, possibly UTF-8/CESU-8, but 8-bit transparent.

I thought a number of older Unix environments still used ISO 8859
encodings, as Linux once did. You're saying, it doesn't work at all, or
there's just no first-class support for it? Especially if it's 8-bit
transparent, that sounds like high-bit characters would at least pass
through safely.

(I'd understand a restriction like "no accented letters in variable
names," but that's generally not the concern when it comes to supporting
an encoding)

> I think I’ll eventually require 8<->16-bit conversion routines. I
> currently supply them in expr.c (“UTF-8 support code: low-level
> functions”) for the ASCII/UTF-8 case.
>
> If all used codepages have a mapping for all possible octets,
> and there are system functions we can use for this, we probably
> should do so.

Not every byte will necessarily have a mapping (e.g. Latin-1
[\x80-\x9F]), but at least having a well-defined behavior for these
would be good.

You would continue to provide these routines as a fallback, however,
right? For systems that don't have them?

> This is, however, strengthening my (tentative) resolution to make this
> into a separate product. This removes certain promises the shell
> offers to scripts that they can rely on, and a lot of functionality.

You would want to guarantee that e.g. "printf '\101'" produces
(ASCII) 'A'?

> >Well... you really think the changes are extensive enough to warrant
> >a fork?
>
> In some cases, I’d wish for it to be a fork, and not shipped with the
> main code, for a one-liner. Opening files as O_TEXT instead of
> O_BINARY on Win32, for example.
>
> Deciding point here is the API exposed to the scripts (or the
> interactive user), really.

I do understand the desire to draw a boundary, Venn-diagram-like, inside
of which you have "standard" mksh, with LF line endings, O_BINARY, and
the API and runtime environment promised to scripts and the user.
Outside of that would live "nonstandard mksh," with variations that may
better suit a given platform but stray from the mksh Platonic ideal.

However, I'm not sure I see tha value of making the official source tree
align exactly with the boundaries of "standard mksh." If you have a one-
line change that doesn't fit the standard, you'll be making a lot more
work for yourself by keeping that as a separate patch, be it in the way
of testing, merging, updating/syncing, or even just letting users know
it exists. But if it is integrated as a compile-time option with strong
caveats about the non-standardness, and an appropriately different
KSH_VERSION, then that maintains the distinction and puts scripts/users
on notice about the different promises that are being made.

> I’m a bit unfair here, because lksh is included in the main
> distribution, but is such a thing. Maybe, if the ifdefs don’t get too
> many, we could ship it with the main tarball, but require a specific
> Build.sh option to enable it. Like lksh.

I had in mind compile-time detection of EBCDIC (just depending on
what CFLAGS are set), but if you want the user to be aware that they
are leaving the boundaries of "standard mksh," then this would be a
way of doing so.

> The bikeshed question is, what to name it? mksh/EBCDIC? mksh/zOS?
> mksh/OS390? Or what? What should its KSH_VERSION look like⁴, and do
> you want the mksh/lksh distinction⁵ too?

I think that ASCII or EBCDIC needs to be indicated somehow, as both can
exist in this environment. (ASCII may be an unusual case, but as you've
seen, some folks care about it ;)

"OS390" has the advantage of matching "uname" output, and the Perl port
identifies itself as this, but "zOS" reflects the current name of the
OS. Good arguments both ways.

I can't say what KSH_VERSION should look like, but at least I can help
you make an informed judgment.

No reason I see not to support both mksh and lksh builds. (All I really
know about the latter is Debian's recommendation to use it instead of
mksh when replacing /bin/sh, so I was planning on doing as much.)

> ④ btw, is dot.mkshrc usable in your environment… once we get the bugs
>   out, that is?

Wow, that's a _big_ startup file. I do see \033 hard-coded in a couple
of places... couldn't you take advantage of mksh's interpretation of \e
or \E there?

> ⑤ mostly, lksh uses POSIX arithmetic, whereas mksh use safe arithmetic
>   (guaranteed 32-bit, with wraparound, and the signed arithmetics in
>   shell are actually emulated using uint32_t in C code, plus it has
>   guarantees for e.g. shift right, mostly like the 80386 works, and it
>   can rotate left/right)

Hah, bit-twiddling in shell... that's a use case I wouldn't have thought
of :]


--Daniel


-- 
Daniel Richard G. || [email protected]
My ASCII-art .sig got a bad case of Times New Roman.

Re: [PATCH] IBM z/OS + EBCDIC support

Reply via email to