UTF-8 woes on z/OS, a solution - comments invited

Robert Prins Mon, 04 Sep 2017 10:32:47 -0700

OK, I solved the problem, but maybe someone here can come up with something abit more efficient...

There is a file in the non-z/OS world, that used to be pure ASCII (actuallyCP437/850), but that has now been converted to UTF-8, due to furtherinternationalisation requirements. Said file was uploaded to z/OS, processedinto a set of datasets containing various reports, and those reports were laterdownloaded to the non-z/OS world, using the same process that was used to uploadthem, which could be one of two, IND$FILE, or FTP.

Both FTP and IND$FILE uploads had (and still have) no problems withCP437/850/UTF-8 data, and although an ü might not have displayed as such onz/OS, it would have transferred back to the same ü. However, an ü in UTF-8 nowconsists of two characters, and that means that, replacing spaces with '='characters, the original


|=Süd====|
|=Nord===|

report lines now come out as

|=Süd===|
|=Nord===|

when opened in the non z/OS world with an UTF-8 aware application.

Given that, and in this case I was lucky, the PC file had the option to addcomment-type lines, I solved the problem (the z/OS dataset is processed withPL/I) by adding an extra line to the input file of the required commentdelimiter followed by "ASCII " followed by the 240 ASCII characters from '20'xto 'ff'x. The PL/I program uses this "special meta-data comment" to transformthe input data, which has been translated by IND$FILE/FTP to EBCDIC back into aformat where all UTF-8 initial characters are translated to '1' and all UTF-8follow-on bytes to '0', i.e.

dcl ascii char (240); /* containing the 240 characters from '20'x to 'ff'x, readin via an additional comment record in the original non-z/OS file */

dcl utf8  char (240) init (('11111111111111111111111111111111' ||
                            '11111111111111111111111111111111' ||
                            '11111111111111111111111111111111' ||
                            '00000000000000000000000000000000' ||
                            '00000000000000000000000000000000' ||
                            '00111111111111111111111111111111' ||
                            '11111111111111111111100000000000'));

and to get the number of UTF-8 displayable characters of, e.g. myvar, a char(47)variable, I use the following


dcl a47(47) pic '9';
dcl more    char (20) var;

string(a47) = translate(myvar, utf8, ascii);
more        = copy(' ', 47 - sum(a47));

where "more" is the number of extra blanks that needs to be added into thereport column to ensure that the columns line-out again in the non-z/OS UTF-8world. The (relative) beauty of this approach lies in the fact that thetechnique is completely code-page independent, and could even be used with thePL/I compiler on Windows.

The above works like a charm, however, both translate() and sum(), especially ofpic '9' data, are not exactly the most efficient functions, so the question is,can anyone think of a more efficient way, other than the quick(?) and dirtysolution of using a macro on the non-z/OS side, to set "more" the the requirednumber of characters. I'm open to a PL/I callable assembler routine, but theprocess must be, like the one above, completely code-page independent!


Robert
--
Robert AH Prins
robert.ah.prins(a)gmail.com

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

UTF-8 woes on z/OS, a solution - comments invited

Reply via email to