Re: [PATCH] IBM z/OS + EBCDIC support

Daniel Richard G. Mon, 27 Apr 2015 14:39:34 -0700

On Sun, 2015 Apr 26 14:47+0000, Thorsten Glaser wrote:
> Hi again,
> 
> a few questions back, and a few early comments on the general idea
> (I’ve not yet had the time to look at the patch itself):


Seems you did shortly after you wrote that :)

> Assume we have mksh running on your EBCDIC environment. Let me ask a
> few questions about this sort of environment, coupled with my guesses
> about it.
>
> - the scripts themselves are 'iconv'd to EBCDIC?

That is one way, but z/OS provides multiple layers of conversion that
make the process easier:

1. You can mark a particular NFS mount as "text", so that every file
   read through that mount is converted on the fly from ASCII to EBCDIC.
   (There are also "binary mounts" that do no conversion, and reading an
   ASCII file through them gives you gibberish.) Of course, reading any
   kind of binary data through a "text" mount will not go well.

   This is how I worked for the most part, with the mksh sources being
   read through such a "text"-mode NFS mount. Especially as the only
   text editor available in this z/OS installation appears to be vi :<

2. For files that are on the local filesystem, you can assign them an
   extended filesystem attribute marking them as either binary or text,
   and if text, you can specify the code page (be it ASCII or EBCDIC).
   So if the file is properly "tagged," and auto-conversion in enabled,
   then an EBCDIC application can read an ASCII file and have it work
   transparently.

   The tagging utility is called "chtag", and the auto-conversion
   parameter is "AUTOCVT", for Googling purposes.

3. I did notice that after the mksh build completed, the following files
   were tagged as EBCDIC text:

     t IBM-1047    T=on  Rebuild.sh
     t IBM-1047    T=on  conftest.c
     t IBM-1047    T=on  rlimits.gen
     t IBM-1047    T=on  sh_flags.gen
     t IBM-1047    T=on  signames.inc
     t IBM-1047    T=on  test.sh

   So there's also an element of auto-tagging, even though it
   shouldn't make a difference here conversion-wise as the files are
   already in EBCDIC.

To return to your question, while conversion with iconv(1) is available,
you can see it's far from the most convenient approach.

> - stuff like print/printf \x4F is expected to output '|' not 'O'

Yep! Just tried it in the shell:

    $ printf '\x4F\n'
    |

> - what about \u20AC? UTF-8? UTF-EBCDIC?

Many code pages have received euro-sign updates; e.g. EBCDIC 924 is
the euro-ified version of EBCDIC 1047. But that doesn't mean that
anyone actually _uses_ the updated versions. I haven't seen 924 pop
up anywhere.

UTF-8 is known to the system. There is an IBM code-page ID for it
(1208), iconv(1) knows about it, and you can tag files as UTF-8 text.
I don't think that necessarily indicates wider Unicode support,
however, as it would ultimately get converted to EBCDIC 1047 (or
whatever) anyway.

UTF-EBCDIC exists, but you wouldn't know it from the z/OS environment.
No code-page ID [as far as I've found], no mention in "iconv -l". When I
asked the mainframe guys at my company about it, they told me, "you
don't really want to deal with that."

I glanced at the Wikipedia article for UTF-EBCDIC, and can vouch for the
accuracy of this paragraph:

    This encoding form is rarely used, even on the EBCDIC-based
    mainframes for which it was designed. IBM EBCDIC-based mainframe
    operating systems, such as z/OS, usually use UTF-16 for complete
    Unicode support. For example, DB2 UDB, COBOL, PL/I, Java and the IBM
    XML toolkit support UTF-16 on IBM mainframes.

Locale support in z/OS is like it was in Linux over a decade ago: If
you're a U.S. user, use the default code page; if you're a Russian user,
use a Russian code page, and so on... and all code pages are 8 bits.

> - keyboard input is in EBCDIC?

I worked by way of SSH'ing in to the z/OS OMVS Unix environment.
Everything in OMVS is EBCDIC, but of course my SSH client sends and
receives everything in ASCII. There is a network translation layer in
between, apart from the file-content conversion layers previously
mentioned, that makes it all work transparently.

A "real" mainframe connection, however, would be through TN3270, using
the x3270 program or the like. Then the conversion is happening on the
client side. But this is not relevant to mksh, because you don't get the
z/OS Unix environment through TN3270; you get the old-school full-screen
menu-driven interface that mainframe operators deal with.

(You can bring up OMVS via the TN3270 menu screens, but then you get a
horrible IRC-like line-based interface that sidesteps the normal Unix
shell. IMO, still irrelevant to mksh.)

> - is there anything that allows Unicode input?

>From the keyboard? I've not seen anything suggesting this is possible.
Even IBM's z/OS Unicode support via UTF-16 is, as far as I can tell, for
use by applications and not by logged-in users.

My understanding of why things like locale/encoding support on the
console/terminal aren't up to snuff on z/OS is that this would only
benefit the crusty mainframe operators, who are comparatively small in
number compared to the user base of the application(s) running on the
system. At the same time, there is z/Linux (Linux on the mainframe), and
most organizations that want a modern mainframe Unix environment---and
no EBCDIC goofiness---just go with that. z/Linux is not an option for
me, however, so we have to confront all this weirdness head-on.

> Daniel Richard G. dixit:
>
> >conditionalized in the code. Primarily, EBCDIC has the normal [-9A-Za-
> >z] characters beyond 0x80, so it is not possible to set the high bit
> >for signalling purposes---which mksh seems to do a lot of.
>
> Indeed. You probably refer to the variable substitution stuff (where
> '#'|0x80 is used for ${foo##bar}) and possibly the MAGIC stuff in
> extglobs @(foo|bar).
>
> That’s all legacy. I think it can go unconditionally.

I couldn't even suss out exactly what the 0x80 was being used for,
though I did see several variable-substitution tests fail. I'm glad
someone else knows that code like the back of their hand!

> >* Added clauses for TARGET_OS == "OS/390"
>
> Is OS/390 always an EBCDIC environment?

Applications can be compiled in ASCII mode, and in fact I was
experimenting with this. But the compilers, tools et al. still need
EBCDIC input. (ASCII mode just means that all char/string literals are
given ASCII codepoints instead of EBCDIC, and C/system calls accept and
return ASCII. You end up with an ASCII application, basically, even
though the source and environment aren't.)

> >* '\012\015' != '\n\r' on this platform, so use the latter
>
> Agreed. I think I can change most of the C code to use the char
> versions, i.e. '\n' and '@' instead of 0x0A or 0x40. I will have to
> see about Build.sh though.

It's certainly easier to let the compiler transcode characters. I
did leave in conditionals so that ASCII builds continue to use
numerical escapes, however, in light of those being more reliable on
some older systems.

> Looking at the patch (editing this eMail later), it seems you have
> conditionalised those things nicely. That is good!

I, too, take portability seriously :)

> *BUT* some things in Build.sh – at least the $lfcr thing – are
> dependent on the *host* OS, not the *target* OS.

Ah, yes, that's true. xlc on z/OS can't cross-compile, but there is at
least one compiler that can cross-compile to z/OS (Dignus Systems/C).

> So, as far as I see, we will require two checks:
>
> • host OS (machine running Build.sh): ASCII-based or EBCDIC?

Perhaps the "printf '\x4F'" thing can be used to detect an EBCDIC build
environment...

> • target OS (machine running the mksh/lksh binary created):
>   z/OS ASCII, z/OS EBCDIC, or anything else?

There is also the matter of the EBCDIC variant. Of the EBCDIC code
pages that contain all of ASCII, the characters are generally
assigned consistently to the same codepoints. But one exception
occurs between EBCDICs 1047 and 037, which assign '[', ']', and '^'
differently---characters that are significant to the shell.

(EBCDIC 037 is likely to be the second-most-popular code page after
1047, and is in fact the x3270 default.)

I don't think it's feasible to have a single mksh binary support
multiple EBCDIC variants, however, so IMO this matter is best left to
the user's discretion in what CFLAGS they provide (-qconvlit option). As
long as the code specifies these characters literally instead of
numerically, everything should fall in line.

> Remember mksh is cross-buildable. So we’ll need to come up with
> (compile-time) checks for all those.

That should be straightforward; the C_CTYPE_ASCII hack from gnulib is a
nice compile-time way of determining the character set that doesn't rely
on compiler-specific symbols. Even the EBCDIC variant could be detected
that way, though IMO it is better to remain as agnostic as possible on
this point.

> >* NSIG is, amazingly, not #defined on this platform. Sure would be
> >  nice if the fancy logic that calculates NSIG could conditionally
> >  #define it, rather than a TARGET_OS conditional... :-)
>
> No, a TARGET_OS conditional is probably good here, as you cannot
> really guess NSIG – as you noticed, you seem to have 37 signals, the
> highest number of which is 39.
>
> The best way to determine NSIG is to look at libc sources, followed by
> looking at libc binaries (e.g. determine the size of sys_siglist).

No libc sources here, I'm afraid (everything is "OCO," as the mainframe
folks say---"object code only"). I'm not sure there's a sys_siglist to
be found, even if I could tell where the runtime libraries live.

The Build.sh code wouldn't be able to suss out the signals any better if
it knew about these that are unique to z/OS? IBM might add even more
signals down the line, after all...

> >* On this platform, xlc -qflag=... takes only one suboption, not two
>
> Hm. Is this a platform thing, a compiler version thing, etc?

It seems that xlc on z/OS, though sharing a name with IBM's AIX
compiler, comes from a different codebase. It _feels_ like an ancient
version of xlc, in not recognizing many modern options, yet I am told
that the version installed is current as of 2014.

(The version reported, V2.1, also happens to be the running version of
z/OS. I don't think that's a coincidence.)

> >* Some special flags are needed for xlc on z/OS that are not needed
> >  on AIX, like to make missing #include files an error instead of a
> >  warning (!). Conversely, most of those AIX xlc flags are not
> >  recognized
>
> Can we get this sorted out so it continues working on AIX? I do not
> have access to an AIX machine any longer, unfortunately.
>
> (Later: conditionalised, looks good.)

I tried first to have Build.sh just test to see if the options were
recognized, but (1) xlc only gives warnings if it doesn't recognize the
option/suboptions, and (2) neither -qhaltonmsg nor -qseverity will
accept making those warnings into errors, even though they gladly do so
for other warnings!

I do have access to modern AIX systems with xlc. Ohhh, I'm seeing some
nastiness there:

    "rlimits.gen", line 20.20: 1506-191 (E) The character # is not a
    valid C source character.
    "rlimits.gen", line 20.21: 1506-191 (E) The character # is not a
    valid C source character.

I can troubleshoot this once the code has stabilized...

> >* Added a note that EBCDIC has \047 as the escape character rather
> >  than \033
>
> Do EBCDIC systems use ANSI escapes (like ESC [ 0 m) still?

I can't say whether that's the case in the TN3270 environment, but if
I'm connected via SSH, then my terminal is just a standard TERM=xterm
terminal window that responds accordingly. Of course, this is going
through the automatic EBCDIC<->ASCII conversion; not just printable
characters but also control characters get translated.

> >+++ check.pl
> >
> >* I was getting a parse error with an expected-exit value of
> >  "e != 0", and adding \d to this regex fixed things... this wasn't
> >  breaking for other folks?
>
> No. I don’t pretend to know enough Perl to even understand that.
>
> But I think I see the problem, looking at the two regexps. I think
> that “+-=” is expanded as range, which includes digits in ASCII. I’ll
> have to go through CVS and .tgz history and see whether this was
> intentional or an accidental fuckup.

Ohhh, yes, good catch! Yes, that hyphen really ought to be the first
character inside the brackets. I can't imagine that this was
intentional---
if it were, then it would be awful form for transparency.

> >+++ check.t
> >
> >* The "cd-pe" test fails on this system (perhaps it should be
> >  disabled?) and the directories were not getting cleaned up properly
>
> That fails on many systems. Sure, we can disable it.
>
> What is $^O in Perl on your platform?

    $ perl -e 'print $^O' ; echo
    os390

It's possible that this value may vary, depending on who did the Perl
port. But this seems related to the value printed by "uname"
(lowercased, minus the slash), and so should hopefully be stable.

> >* If compiling in ASCII mode, #define _ENHANCED_ASCII_EXT so that as
> >  many C/system calls are switched to ASCII as possible (this is
> >  something I was experimenting with, but it's not how most people
> >  would be building/using mksh on this system)
>
> So it’s possible to use ASCII on the system, but atypical?

Right. In an EBCDIC environment, it's not terribly useful:

    $ ./mksh -c 'echo zzz'; echo
    :::

I'm looking into setting up an environment that is all-ASCII, starting
from /bin/sh (hence why I'm here), but have yet to figure out how to
switch off the EBCDIC<->ASCII network conversion layer. (There are so
many layers of conversion on z/OS that at times it's hard to keep track
of what is what!)

> >* Because EBCDIC characters like 'A' will have a negative value if
> >  signed chars are being used, #define the ORD() macro so we can
> >  always get an integer value in [0, 255]
>
> Huh, Pascal anyone? :)

I figured, Perl and Python have picked it up, why not here too? ;)

> >+++ edit.c (back to patch order)
>
> Here’s where we start going into Unicode land. This file is the one
> that assumes UTF-8 the most.

But that need not be active, right? I saw UTFMODE in run-time
conditionals all over the place.

> >* I don't understand exactly what is_mfs() is used for, but I'm
> >  pretty sure we can't do the & 0x80 with EBCDIC (note that e.g.
> > 'A' == 0xC1)
>
> Motion separator. It’s simply assumed that, when you e.g. jump
> forwards word-wise, anything with bit7 set is Unicode and to be jumped
> over, as we don’t have iswprint() et al.

Ah, okay, I see. Yes, that wouldn't really be applicable to
EBCDIC, then.

> >* Don't know much about XFUNC_VALUE(), but that & 0x7F looks un-
> >  kosher for EBCDIC
>
> No, that’s actually fine, that’s an enum (with < 128 values), and the
> high bit is used here to swallow a trailing tilde, like in ANSI Del
> (^[[3~).

I'm very happy not to have to figure _everything_ out =)

> >I will be happy to provide further testing and answer any questions
> >as needed.
>
> OK. This is just a start.

Agreed.

> I’ll add the… hopefully not discouraging… comments now. As I’ve said,
> I really like the enthusiasm, and absolutely want you to continue with
> this. There is just a very big thing:
>
> One of mksh’s biggest strengths is that it’s consistent across *all*
> platforms. An analogy, to help understand:
>
> I don’t know how much you know about Microsoft Windows, but they use
> CR+LF (\r\n) as line separators usually for (old) native code. There
> are Unix-like environments for it (Cygwin, and the much better
> Interix/SFU/SUA, and the less-well-working UWIN and PW32), and you can
> compile mksh for those; mksh will then behave as on Unix, i.e. require
> LF-only (\n) line endings.
>
> Someone has started to port mksh to native WinAPI, and that port is
> not 100% compatible to mksh, just “similar”, and uses it as base code.
> That implementation then can use CR+LF.
>
> By definition, mksh does all its I/O in binary mode (not “text” mode,
> so no CR+LF or (old Macintosh) CR-only line endings), and in the UTF-8
> encoding of 16-bit Unicode as charset.

There are the differences/similarities in behavior (LF vs. CR+LF), and
then there are the differences in code (Win32 API calls instead of POSIX
calls). I'm presuming your point is that the latter is more of a
practical concern... I mean, if you could support MacOS 9 with its
CR-only line endings just by tweaking a few lines of code, surely that
would be desirable, even if it makes an exception to the "consistent
across all platforms" property? In reality, it would require a lot more
invasive changes (not least due to the lack of POSIX), and I'd
anticipate _that_ to be the main objection to integrating such support.

> I’ve got a suggestion for you here, though. Most of it depends on some
> answers to the questions I had above, this is just an initial rough
> draft, to be discussed.
>
> I’ll merge most of the EBCDIC- and z/OS-related changes. A future mksh
> release will compile for z/OS in ASCII mode out of the box, and pass
> all of its tests there, if at all possible. Even if this is not how a
> typical z/OS user would use mksh, this should be easy.

I agree that this should be possible, and I hope to get there yet.

> You’ll be the maintainer of something we call mksh/zOS, or something
> like that (or mksh/EBCDIC), which has a separate $KSH_VERSION string.
> I was thinking either “@(#)EBCDIC MKSH R…” or “@(#) Z/OS MKSH R…”,
> with LKSH instead of MKSH for builds with -L, or just one string, and
> you decide on whether you want “POSIX arithmetics” mode always enabled
> or not (the main compile-time dif‐ ference of lksh) – but, why remove
> the flexibility. I’d also ask Michael Langguth to make mksh/Win32 fit
> this scheme (i.e. use something like “@(#) WIN32 MKSH”, depending on
> what we agree on; currently, mksh/Win32 is based on mksh R39, so it
> didn’t have the lksh yet).

Ooof... a long-term commitment like this is going to be hard for me. For
my part, porting mksh is just one piece of a larger puzzle I'm building.
I'd like to leave things in good shape here, and then move on to other
areas of investigation.

My hope is that EBCDIC support can be integrated by generalizing code
that assumes ASCII, and adding a minimum of code that is specific to
EBCDIC (which is why my patch all but apologizes for the tables needed
for Ctrl-key mapping). The one change in my patch that best exemplifies
the approach I have in mind is

    -               if ((s[0] | 0x20) == 'x') {
    +               if (s[0] == 'x' || s[0] == 'X') {

That not only makes the code EBCDIC-compatible, it arguably makes it
clearer. Not a separate section of code prone to bit rot, but simply a
particular discipline applied to the common/platform-independent code.

> Details of this can be hashed out later. We can have this in a range
> of varieties:
> • you’ll ship mksh-ebcdic-*.tgz files from a separate repository
> • I’ll ship them, from a separate repository
> • we develop this in the same repo, in a separate branch
> • or it could be a bunch of #ifdefs

I would hope for the latter, keeping the set of #ifdefs as small and
manageable as possible.

> To be honest, I’d prefer looking at the amount of ifdefs before
> agreeing to the latter though. mksh/Win32 is also separately
> developed; while the code is close to “main” mksh, there *is* a patch,
> part of which I’d prefer to not ship in mksh-R*.tgz itself. (But to
> keep the delta small is a good aim.) This also allows for different
> development tempo and release schedules.

Fair enough; I think that the changes needed to support EBCDIC would
weigh in a lot lighter than those needed to support Windows natively.
The same POSIX API is used, after all---the worst of it is the Ctrl-key
mapping tables.

> As I said, I’ll gladly add “not-hurting” portability to EBCDIC to the
> main code, e.g. remove the use of |0x80 as flag magic. (I’ll come up
> with something, probably after the R51 release though. I’ve got
> ideas.) But mksh uses UTF-8, and my plans for it will only make this
> worse, e.g. I’m planning to make some code use 16-bit Unicode
> internally (though part of *this* may make EBCDIC easier again). I
> cannot commit to keep supporting EBCDIC systems, due to lack of
> resources (my own time, skills (I have no experience with nōn-ASCII-
> based systems) and lack of such machines).

Will mksh continue to support ISO 8859-1 (Latin-1) environments, even if
with reduced functionality? As long as UTF-8 mode isn't outright
_required_, and bit 7 is left alone, I don't see that EBCDIC systems
have much to worry about.

> Do you think you can help me out there and invest a little time (a few
> hours per month I guess) and maintain a port of mksh to EBCDIC-based
> systems (or even just z/OS) for a long-ish time?

I'll be happy to test, troubleshoot and submit patches as needed. Once
EBCDIC is supported, and minimal care is taken over time to avoid
breaking it, fixing it down the line will be straightforward.

> Do you think you can, or want to, develop this separately, merging
> changes back and forth? (I can, of course, do most of the changes-
> merging work, but you will have to be there to deal with EBCDIC-
> specific façettes.)

Well... you really think the changes are extensive enough to
warrant a fork?

> This is all volunteer work, so I’ll understand if you cannot or don’t
> want to commit to something long-lasting like this either. But from
> the two messages you already sent, I presume you have got some kind of
> interest ☺

Oh, the interest is there! I just didn't think of this work as being as
brave-new-world-ish (in terms of code) as a Win32 or MacOS 9 native port
would be...

> Legalities: I just request that anything I merge is licenced under The
> MirOS Licence¹, I don’t require anything like copyright assignment or
> that, and I don’t even impose any licencing terms on the derivates
> (like mksh/Win32), but I prefer they use a BSD-style licencing scheme
> for the whole. (Michael said he’s planning to publish the whole Win32-
> portability library under BSD-ish terms as well.)

I am quite agreeable to BSD-ish terms, and in any event I hereby
explicitly agree to release all of my work relating to mksh under the
same license terms as mksh itself.

> ① I have once, on an OSI mailing list, stated requirements for a
>   successor to The MirOS Licence. I don’t believe it will come to
>   that a successor is written, but should there be one, I’d be happy
>   to be able to switch the licence to it. Those mostly are: lawyer-
>   written, also applies to neighbouring rights such as database law
>   (in some EU countries). I wish the licence to be tailored to EU
>   (mostly .de, as I live there) law, protect all involved (authors,
>   contributors, licensors, licensees), but usable internationally as
>   far as that’s possible. I don’t really wish to touch the topic of
>   patent licences, but let it be understood that an implicit patent
>   grant is included.

That's a fairly tall order. I've heard plenty about the FSFE from time
to time, but nothing about a BSD-centric umbrella organization based in
the EU, that would be an appropriate body to tackle developing such a
license...

> Urgh. I’m rambling again. Sorry about that.

Well, you've read this far, so at least you're game as well ;)


--Daniel


-- 
Daniel Richard G. || sk...@iskunk.org
My ASCII-art .sig got a bad case of Times New Roman.

Re: [PATCH] IBM z/OS + EBCDIC support

Reply via email to