subject:"UTF\-8 woes on z\/OS, a solution \- comments invited"

Re: UTF-8 woes on z/OS, a solution - comments invited

2017-09-06 Thread Don Poitras

In article  
you wrote:
> On Wed, Sep 6, 2017 at 8:34 AM, Don Poitras  wrote:

> > For collating, I think most people use the ICU libraries. I know the C++
> > version has been used on z/OS by lots of folks and some searching found
> > a COBOL page. I have no idea if z/OS COBOL 4.2 can use it.
> >
> > http://userguide.icu-project.org/usefrom/cobol
> >
> > ?Thanks! I'll read that over. We don't use anything other than CP-037 and
> IBM-1047 EBCDIC (mainly the former) for our character data. But I like to
> keep up with what is happening in the real world.?
> Maranatha! <><
> John McKown

You're welcome. I noticed that that page says it has sample programs,
but they got truncated somehow. I found the full samples on github:

https://github.com/morecobol/icu4c-cobol-samples/tree/master/src

-- 
Don Poitras - SAS Development  -  SAS Institute Inc. - SAS Campus Drive
sas...@sas.com   (919) 531-5637Cary, NC 27513

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: UTF-8 woes on z/OS, a solution - comments invited

2017-09-06 Thread John McKown

On Wed, Sep 6, 2017 at 8:34 AM, Don Poitras  wrote:

> For collating, I think most people use the ICU libraries. I know the C++
> version has been used on z/OS by lots of folks and some searching found
> a COBOL page. I have no idea if z/OS COBOL 4.2 can use it.
>
> http://userguide.icu-project.org/usefrom/cobol
>
> Thanks! I'll read that over. We don't use anything other than CP-037 and
IBM-1047 EBCDIC (mainly the former) for our character data. But I like to
keep up with what is happening in the real world.




-- 
UNIX was not designed to stop you from doing stupid things, because that
would also stop you from doing clever things. -- Doug Gwyn

Maranatha! <><
John McKown

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: UTF-8 woes on z/OS, a solution - comments invited

2017-09-06 Thread Don Poitras

For collating, I think most people use the ICU libraries. I know the C++
version has been used on z/OS by lots of folks and some searching found
a COBOL page. I have no idea if z/OS COBOL 4.2 can use it.

http://userguide.icu-project.org/usefrom/cobol

In article

Re: UTF-8 woes on z/OS, a solution - comments invited

2017-09-06 Thread John McKown

On Wed, Sep 6, 2017 at 12:42 AM, Peter Hunkeler  wrote:

> >>If for some odd reason you absolutely insist on an EBCDIC-ish approach
> then
> >>you can do what the Japanese have done for decades: Shift Out (SO), Shift
> >>In (SI). Refer to CCSID 930 and CCSID 1390 for inspiration. You'd
> probably
> >>use one of the EBCDIC Latin 1+euro codepages as a starting point, such as
> >>1140, then SO/SI from there to pick up the exceptional characters.
> >>
> >The worst of both worlds.
>
> It's repeating history. The origin of all that code page mess was
> companies (not countries at that time) starting to build their own custom
> code page for any character in need that was not in the (single) EBCDIC
> code page. Later, some standardization was done and country code pages
> evolved.
>
> While is was justifiable at that time, it is not today. Do not start this
> mess again by doing your own code page thing in your programs. Go Unicode,
> UTF-8 or UTF-16, whatever suits best.
>

I agree with the sentiment. On Linux/Intel, I set my locale to en_US.utf8.
The "Go" and "Python3" language definitions _require_ their source to be in
UTF-8. But I wonder how well UTF-8 is really supported by z/OS
_applications_. I'm still stuck on z/OS 1.13 and COBOL 4.2, so I will ask.
Can I directly (and correctly) process UTF-8 coded characters in a COBOL 6
program? Even the multibyte characters? What about DFSORT? From the manual
at
https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.3.0/com.ibm.zos.v2r3.icea100/ice2ca_DFSORT_data_formats.htm
it appears to support UTF8, UTF16, and UTF32. But I'd love to see an
example of how that works. In particular, how do you say "this file is in
UTF8. Sort on the 3rd through the 10th characters."? The problem, to me, is
how do I say "the 3rd through the 10th characters"? If the data is all in
UTF8, then the 3rd character need not start in the 3rd byte. And the number
of bytes is not necessarily 8, but could be from 8 to 32 bytes depending.
Also, according to the same manual (different page), a "character string"
is always in EBCDIC. So I guess if you want to include based on a UTF8
string, you need to use hex encoding.

>
>
> --
> Peter Hunkeler
>
> --
> For IBM-MAIN subscribe / signoff / archive access instructions,
> send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
>

-- 
UNIX was not designed to stop you from doing stupid things, because that
would also stop you from doing clever things. -- Doug Gwyn

Maranatha! <><
John McKown

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

AW: Re: UTF-8 woes on z/OS, a solution - comments invited

2017-09-05 Thread Peter Hunkeler

>>If for some odd reason you absolutely insist on an EBCDIC-ish approach then
>>you can do what the Japanese have done for decades: Shift Out (SO), Shift
>>In (SI). Refer to CCSID 930 and CCSID 1390 for inspiration. You'd probably
>>use one of the EBCDIC Latin 1+euro codepages as a starting point, such as
>>1140, then SO/SI from there to pick up the exceptional characters.
>>
>The worst of both worlds.



It's repeating history. The origin of all that code page mess was companies 
(not countries at that time) starting to build their own custom code page for 
any character in need that was not in the (single) EBCDIC code page. Later, 
some standardization was done and country code pages evolved.


While is was justifiable at that time, it is not today. Do not start this mess 
again by doing your own code page thing in your programs. Go Unicode, UTF-8 or 
UTF-16, whatever suits best.


--
Peter Hunkeler

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: UTF-8 woes on z/OS, a solution - comments invited

2017-09-05 Thread Andy Wood

On Tue, 5 Sep 2017 22:30:59 +0800, Timothy Sipples  wrote:

...
>you can do what the Japanese have done for decades: Shift Out (SO), Shift
>In (SI).

ZCZC
DECADESQUERY


--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: UTF-8 woes on z/OS, a solution - comments invited

2017-09-05 Thread Robert Prins


On 2017-09-05 15:41, Walt Farrell wrote:

On Tue, 5 Sep 2017 10:19:45 -0500, Paul Gilmartin 
wrote:


What language(s) cleanly handle vertical alignment of formatted text output
when the text contains UTF-16 supplemental/surrogate (not in the BMP)
characters? Here's an example of /bin/printf's failure for similar input
with UTF-8 on MacOS:

The script: printf "%-22s+++\n" "Hello World." printf "%-22s+++\n" "Привет
мир." printf "%-22s+++\n" "Bonjour le monde."

writes: Hello World.  +++ Привет мир.  +++ Bonjour le monde.
+++

I wish the "+++" would line up (at least in a monospaced font). What sort
of PICTURE would work for such, not restricting to BMP?


It would take more than a simple script like that, but with programming it
can be done. I have a Python program that does it, for example. The key is
understanding that some characters don't take up any space when printed
(combining characters, for example), and therefore don't contribute to the
length of the output string. When those characters are present you need to
pad the end with blanks if you want a fixed width output string.


And that is exactly what I'm doing with my translate/sum method. I know that any 
character that starts with the orange bytes in 
 is a non-printing one (and 
yes a few exceptions that I do not cater for, assuming the non-z/OS file to 
contain correct UTF-8) and the translate just sets them to zero.


As I wrote, it works like a charm, but may not be the most efficient way of 
doing things, although, given the (still) limited amount of UTF-8 text that has 
to undergo this kind of processing, it's probably way faster than converting the 
entire file into a multi-byte format, and using PL/I WCHAR's and the ULENGTH() 
builtin, which must, in its implementation, do something pretty similar anyway.


Robert
--
Robert AH Prins
robert(a)prino(d)org

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: UTF-8 woes on z/OS, a solution - comments invited

2017-09-05 Thread David W Noon

On Tue, 5 Sep 2017 10:19:45 -0500, Paul Gilmartin
(000433f07816-dmarc-requ...@listserv.ua.edu) wrote about "Re: UTF-8
woes on z/OS, a solution - comments invited" (in
<2075516733653603.wa.paulgboulderaim@listserv.ua.edu>):

[snip]
> What language(s) cleanly handle vertical alignment of formatted text output 
> when
> the text contains UTF-16 supplemental/surrogate (not in the BMP) characters?

Python and Java, at least.

> Here's an example of /bin/printf's failure for similar input with UTF-8 on 
> MacOS:
> 
> The script:
> printf "%-22s+++\n" "Hello World."
> printf "%-22s+++\n" "Привет мир."
> printf "%-22s+++\n" "Bonjour le monde."
> 
> writes:
> Hello World.  +++
> Привет мир.  +++
> Bonjour le monde. +++
> 
> I wish the "+++" would line up (at least in a monospaced font).

This is a bug in your printf UNIX command. It is counting bytes to
determine print position, rather than counting glyphs. It probably isn't
Unicode-aware.
-- 
Regards,

Dave  [RLU #314465]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
david.w.n...@googlemail.com (David W Noon)
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*

 

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: UTF-8 woes on z/OS, a solution - comments invited

2017-09-05 Thread David W Noon

On Tue, 5 Sep 2017 16:33:43 +, Pew, Curtis G
(curtis@austin.utexas.edu) wrote about "Re: UTF-8 woes on z/OS, a
solution - comments invited" (in
<cdcfa846-0e36-494a-96cc-bc90f69e9...@austin.utexas.edu>):

> In Python 3, at least, the built-in substitution facility can handle it as-is:

Python 3 uses UTF-32 for all its default character strings. This
relieves the problem of counting bytes or counting glyphs.
-- 
Regards,

Dave  [RLU #314465]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
david.w.n...@googlemail.com (David W Noon)
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*

 

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: UTF-8 woes on z/OS, a solution - comments invited

2017-09-05 Thread Pew, Curtis G

On Sep 5, 2017, at 10:42 AM, Walt Farrell  wrote:
> 
> Python has Unicode functions that let you examine the characteristics of the 
> characters within a string so you can figure out the proper length when 
> printed, but I'm not aware of anything built-in like a print function that 
> does that automatically. It would be handy.

In Python 3, at least, the built-in substitution facility can handle it as-is:

Python 3.5.4 (default, Aug 12 2017, 14:31:52) 
[GCC 4.2.1 Compatible Apple LLVM 8.1.0 (clang-802.0.42)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> def fmtprt(hw):
... print("%-22s+++\n" % hw)
... 
>>> fmtprt("Hello, world!")
Hello, world! +++

>>> fmtprt("Привет мир.")
Привет мир.   +++

>>> fmtprt("Bonjour le monde.")
Bonjour le monde. +++

>>> 


-- 
Pew, Curtis G
curtis@austin.utexas.edu
ITS Systems/Core/Administrative Services


--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: UTF-8 woes on z/OS, a solution - comments invited

2017-09-05 Thread Tony Harminc

On 5 September 2017 at 10:30, Timothy Sipples  wrote:

>
> If for some odd reason you absolutely insist on an EBCDIC-ish approach then
> you can do what the Japanese have done for decades: Shift Out (SO), Shift
> In (SI). Refer to CCSID 930 and CCSID 1390 for inspiration. You'd probably
> use one of the EBCDIC Latin 1+euro codepages as a starting point, such as
> 1140, then SO/SI from there to pick up the exceptional characters.
>

Another EBCDIC-ish approach would be UTF-EBCDIC. This is fully support by
z/OS Unicode conversion services; perhaps PL/I (and other things) should
make it Just Work under the covers.

Tony H.

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: UTF-8 woes on z/OS, a solution - comments invited

2017-09-05 Thread Walt Farrell

On Tue, 5 Sep 2017 10:19:45 -0500, Paul Gilmartin  wrote:

>What language(s) cleanly handle vertical alignment of formatted text output 
>when
>the text contains UTF-16 supplemental/surrogate (not in the BMP) characters?
>Here's an example of /bin/printf's failure for similar input with UTF-8 on 
>MacOS:
>
>The script:
>printf "%-22s+++\n" "Hello World."
>printf "%-22s+++\n" "Привет мир."
>printf "%-22s+++\n" "Bonjour le monde."
>
>writes:
>Hello World.  +++
>Привет мир.  +++
>Bonjour le monde. +++
>
>I wish the "+++" would line up (at least in a monospaced font).
>What sort of PICTURE would work for such, not restricting to BMP?

It would take more than a simple script like that, but with programming it can 
be done. I have a Python program that does it, for example. The key is 
understanding that some characters don't take up any space when printed 
(combining characters, for example), and therefore don't contribute to the 
length of the output string. When those characters are present you need to pad 
the end with blanks if you want a fixed width output string.

Python has Unicode functions that let you examine the characteristics of the 
characters within a string so you can figure out the proper length when 
printed, but I'm not aware of anything built-in like a print function that does 
that automatically. It would be handy.

Presumably one could do that in other languages, too. And presumably one could 
implement a print function that did that automatically. Perhaps someone has, or 
perhaps some language can do it automatically.

-- 
Walt

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: UTF-8 woes on z/OS, a solution - comments invited

2017-09-05 Thread Paul Gilmartin

On Tue, 5 Sep 2017 22:30:59 +0800, Timothy Sipples wrote:
>
>FYI, if DB2 for z/OS is in the loop then DB2 will convert UTF-8 to UTF-16
>for your PL/I application(s). Just store the UTF-8 data in DB2, use the
>WIDECHAR datatype, and it all happens automagically, effortlessly, with no
>UTF-8 to UTF-16 programming required. See here for more information:
>
>https://www.ibm.com/support/knowledgecenter/en/SSEPEK_12.0.0/char/src/tpc/db2z_processunidatapli.html
>
What language(s) cleanly handle vertical alignment of formatted text output when
the text contains UTF-16 supplemental/surrogate (not in the BMP) characters?
Here's an example of /bin/printf's failure for similar input with UTF-8 on 
MacOS:

The script:
printf "%-22s+++\n" "Hello World."
printf "%-22s+++\n" "Привет мир."
printf "%-22s+++\n" "Bonjour le monde."

writes:
Hello World.  +++
Привет мир.  +++
Bonjour le monde. +++

I wish the "+++" would line up (at least in a monospaced font).
What sort of PICTURE would work for such, not restricting to BMP?

>If for some odd reason you absolutely insist on an EBCDIC-ish approach then
>you can do what the Japanese have done for decades: Shift Out (SO), Shift
>In (SI). Refer to CCSID 930 and CCSID 1390 for inspiration. You'd probably
>use one of the EBCDIC Latin 1+euro codepages as a starting point, such as
>1140, then SO/SI from there to pick up the exceptional characters.
>
The worst of both worlds.

-- gil

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: UTF-8 woes on z/OS, a solution - comments invited

2017-09-05 Thread Pew, Curtis G

On Sep 5, 2017, at 8:54 AM, Paul Gilmartin 
<000433f07816-dmarc-requ...@listserv.ua.edu> wrote:
> 
> Are you confusing UTF-16 and UCS-2?
>https://en.wikipedia.org/wiki/UTF-16
> 
>UTF-16 (16-bit Unicode Transformation Format) is a character encoding
>capable of encoding all 1,112,064 valid code points of Unicode. The
>encoding is variable-length, as code points are encoded with one or two
>16-bit code units. (also see Comparison of Unicode encodings for a
>comparison of UTF-8, -16 & -32)
> 
>UTF-16 developed from an earlier fixed-width 16-bit encoding known as UCS-2
>(for 2-byte Universal Character Set) once it became clear that 16 bits were
>not sufficient for Unicode's user community.[1]

I was trying to say what the second paragraph you quoted says, without 
explicitly mentioning UCS-2. At least part of the answer to “Why is there 
UTF-16?” is “Because once there was UCS-2.”

-- 
Pew, Curtis G
curtis@austin.utexas.edu
ITS Systems/Core/Administrative Services


--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: UTF-8 woes on z/OS, a solution - comments invited

2017-09-05 Thread Timothy Sipples

Paul Gilmartin wrote:
>Why is there UTF-16?
>[]
>o It lacks the compactness of UTF-8 in the case of Latin text.

Japanese Kanji, Traditional Chinese, Simplified Chinese, and emoji (!), as
examples, are not Latin text. More than 1.5 billion people is a lot of
people, and that's not counting all the billions of emoji users. :-)

And who cares about this compactness, really? Bytes are no longer *that*
precious, especially when they're compressed anyway.

>(What does Java use internally?)

UTF-16, as it happens.

FYI, if DB2 for z/OS is in the loop then DB2 will convert UTF-8 to UTF-16
for your PL/I application(s). Just store the UTF-8 data in DB2, use the
WIDECHAR datatype, and it all happens automagically, effortlessly, with no
UTF-8 to UTF-16 programming required. See here for more information:

https://www.ibm.com/support/knowledgecenter/en/SSEPEK_12.0.0/char/src/tpc/db2z_processunidatapli.html

If for some odd reason you absolutely insist on an EBCDIC-ish approach then
you can do what the Japanese have done for decades: Shift Out (SO), Shift
In (SI). Refer to CCSID 930 and CCSID 1390 for inspiration. You'd probably
use one of the EBCDIC Latin 1+euro codepages as a starting point, such as
1140, then SO/SI from there to pick up the exceptional characters.

Timothy Sipples
IT Architect Executive, Industry Solutions, IBM z Systems, AP/GCG/MEA
E-Mail: sipp...@sg.ibm.com

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: UTF-8 woes on z/OS, a solution - comments invited

2017-09-05 Thread Paul Gilmartin

On 2017-09-05, at 06:36, Pew, Curtis G wrote:
> 
> Unicode was originally supposed to be a fixed-width, 16-bit encoding. 
> Fixed-width was actually a design criteria for the original developers. It 
> was only after it became clear that there was no possible way to fit all the 
> needed characters into 16 bits that the “astral planes”[1] were (reluctantly) 
> added to Unicode and the various UTF encodings defined. In this light, UTF-16 
> is the closest thing to the original version of Unicode. Also, if your text 
> includes few or no Latin characters UTF-16 may be just as compact, or even 
> more compact, than UTF-8, and can probably be processed more easily.
>  
Are you confusing UTF-16 and UCS-2?
https://en.wikipedia.org/wiki/UTF-16

UTF-16 (16-bit Unicode Transformation Format) is a character encoding
capable of encoding all 1,112,064 valid code points of Unicode. The
encoding is variable-length, as code points are encoded with one or two
16-bit code units. (also see Comparison of Unicode encodings for a
comparison of UTF-8, -16 & -32)

UTF-16 developed from an earlier fixed-width 16-bit encoding known as UCS-2
(for 2-byte Universal Character Set) once it became clear that 16 bits were
not sufficient for Unicode's user community.[1]

> Since Java was developed when Unicode was still supposed to be a 16-bit 
> encoding the early versions at least used what we would now call UTF-16. As I 
> recall, there was a significant period of time after Unicode abandoned a 
> fixed-width 16-bit representation before Java implementations really 
> supported characters from the “astral planes”.
> 
> 
> [1] Unicode is still organized into 64K ranges called “planes”. The original 
> 0–x range is called the “Basic Multilingual Plane” (BMP) and “astral 
> planes” is a convenient nickname for the other ranges.

-- gil

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: UTF-8 woes on z/OS, a solution - comments invited

2017-09-05 Thread Pew, Curtis G

On Sep 4, 2017, at 9:02 PM, Paul Gilmartin 
<000433f07816-dmarc-requ...@listserv.ua.edu> wrote:
> 
> Why is there UTF-16?
> 
> o It's a variable-length encoding, involving the same complexities as UTF-8.
> 
> o It lacks the compactness of UTF-8 in the case of Latin text.
> 
> Is it because it's (sort of) an extension of UCS-2?
> 
> (What does Java use internally?)

Unicode was originally supposed to be a fixed-width, 16-bit encoding. 
Fixed-width was actually a design criteria for the original developers. It was 
only after it became clear that there was no possible way to fit all the needed 
characters into 16 bits that the “astral planes”[1] were (reluctantly) added to 
Unicode and the various UTF encodings defined. In this light, UTF-16 is the 
closest thing to the original version of Unicode. Also, if your text includes 
few or no Latin characters UTF-16 may be just as compact, or even more compact, 
than UTF-8, and can probably be processed more easily.

Since Java was developed when Unicode was still supposed to be a 16-bit 
encoding the early versions at least used what we would now call UTF-16. As I 
recall, there was a significant period of time after Unicode abandoned a 
fixed-width 16-bit representation before Java implementations really supported 
characters from the “astral planes”.

[1] Unicode is still organized into 64K ranges called “planes”. The original 
0–x range is called the “Basic Multilingual Plane” (BMP) and “astral 
planes” is a convenient nickname for the other ranges.

-- 
Pew, Curtis G
curtis@austin.utexas.edu
ITS Systems/Core/Administrative Services

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: UTF-8 woes on z/OS, a solution - comments invited

2017-09-05 Thread Scott Chapman

On Mon, 4 Sep 2017 21:02:29 -0500, Paul Gilmartin  wrote:

>(What does Java use internally?)
>
>-- gil

Currently Java does use UTF-16, but Java 9 will get a little smarter about 
that, storing ins 1 byte/character ISO8859-1/Latin-1 where it can. 
http://openjdk.java.net/jeps/254

The G1 garbage collector (which I believe will be the new default) will also 
get string deduplication:
http://openjdk.java.net/jeps/192

Since those are internal JVM things, if those make will it into the IBM JVM I 
of course don't know. 

Scott Chapman

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: UTF-8 woes on z/OS, a solution - comments invited

2017-09-05 Thread Elardus Engelbrecht

Linda wrote:

>Ummm and I heard (and used it as) it as Seriously Outa Luck!

That is the one of the polite versions of "Sh*t Outa Luck"... ;-D


SOL is also (over 200+ meanings according to http://www.acronymfinder.com ):

System Off Line

Smile Out Loud
Sadly Outta Luck (polite form)
Stuff Outta Luck (polite form)
Sorta Outta Luck
Sobbing Out Loud  ... wh... sni... sni... ;-)
Swear Out Loud  ... %$#@#@%^&**

Re: UTF-8 woes on z/OS, a solution - comments invited

2017-09-04 Thread Paul Gilmartin

On Tue, 5 Sep 2017 02:02:03 +0100, David W Noon wrote:

>On Mon, 4 Sep 2017 17:07:08 -0700, Charles Mills (charl...@mcn.org)
>wrote about "Re: UTF-8 woes on z/OS, a solution - comments invited" (in
><02b401d325da$f0cb4d30$d261e790$@mcn.org>):
>
>> COBOL or Java, but what about the OP's PL/I?
>
>IBM Enterprise PL/I has WIDECHAR(*), which supports UTF-16. It also has
>the UTF8(), UTF8TOCHAR() and UTF8TOWCHAR() built-in functions that
>translate host code page to UTF-8, UTF-8 to host code page, and UTF-8 to
>UTF-16, respectively. These will probably handle UTF-8 translations more
>reliably than IND$FILE does.
>
>The problem is the complexity that was previously hidden is now visibly
>the province of the programmer.
>
Why is there UTF-16?

o It's a variable-length encoding, involving the same complexities as UTF-8.

o It lacks the compactness of UTF-8 in the case of Latin text.

Is it because it's (sort of) an extension of UCS-2?

(What does Java use internally?)

-- gil

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: UTF-8 woes on z/OS, a solution - comments invited

2017-09-04 Thread Linda

Ummm and I heard (and used it as) it as Seriously Outa Luck!

Linda

Sent from my iPhone

> On Sep 4, 2017, at 1:50 PM, Charles Mills <charl...@mcn.org> wrote:
> 
> Not the way I heard it.
> 
> Charles
> 
> 
> -Original Message-
> From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On 
> Behalf Of Paul Gilmartin
> Sent: Monday, September 4, 2017 1:35 PM
> To: IBM-MAIN@LISTSERV.UA.EDU
> Subject: Re: UTF-8 woes on z/OS, a solution - comments invited
> 
>> On Mon, 4 Sep 2017 13:17:48 -0700, Charles Mills wrote:
>> 
>> ... another vulgar cliché, ... indeed SOL ...
> ???
> Simply Outa Luck?
> 
> --
> For IBM-MAIN subscribe / signoff / archive access instructions,
> send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: UTF-8 woes on z/OS, a solution - comments invited

2017-09-04 Thread Charles Mills

Well there you go, then.

FTP or IND$FILE in binary.

Read in UTF-8 and translate to UTF-16.

Process in UTF-16.

Translate report UTF-16 to UTF-8.

Download in binary.

QED

Charles


-Original Message-
From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On Behalf 
Of David W Noon
Sent: Monday, September 4, 2017 6:02 PM
To: IBM-MAIN@LISTSERV.UA.EDU
Subject: Re: UTF-8 woes on z/OS, a solution - comments invited

On Mon, 4 Sep 2017 17:07:08 -0700, Charles Mills (charl...@mcn.org) wrote about 
"Re: UTF-8 woes on z/OS, a solution - comments invited" (in
<02b401d325da$f0cb4d30$d261e790$@mcn.org>):

> COBOL or Java, but what about the OP's PL/I?

IBM Enterprise PL/I has WIDECHAR(*), which supports UTF-16. It also has the 
UTF8(), UTF8TOCHAR() and UTF8TOWCHAR() built-in functions that translate host 
code page to UTF-8, UTF-8 to host code page, and UTF-8 to UTF-16, respectively. 
These will probably handle UTF-8 translations more reliably than IND$FILE does.

The problem is the complexity that was previously hidden is now visibly the 
province of the programmer.
--
Regards,

Dave  [RLU #314465]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
david.w.n...@googlemail.com (David W Noon)
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*

 

--
For IBM-MAIN subscribe / signoff / archive access instructions, send email to 
lists...@listserv.ua.edu with the message: INFO IBM-MAIN

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: UTF-8 woes on z/OS, a solution - comments invited

2017-09-04 Thread David W Noon

On Mon, 4 Sep 2017 17:07:08 -0700, Charles Mills (charl...@mcn.org)
wrote about "Re: UTF-8 woes on z/OS, a solution - comments invited" (in
<02b401d325da$f0cb4d30$d261e790$@mcn.org>):

> COBOL or Java, but what about the OP's PL/I?

IBM Enterprise PL/I has WIDECHAR(*), which supports UTF-16. It also has
the UTF8(), UTF8TOCHAR() and UTF8TOWCHAR() built-in functions that
translate host code page to UTF-8, UTF-8 to host code page, and UTF-8 to
UTF-16, respectively. These will probably handle UTF-8 translations more
reliably than IND$FILE does.

The problem is the complexity that was previously hidden is now visibly
the province of the programmer.
-- 
Regards,

Dave  [RLU #314465]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
david.w.n...@googlemail.com (David W Noon)
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*

 

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: UTF-8 woes on z/OS, a solution - comments invited

2017-09-04 Thread Charles Mills

COBOL or Java, but what about the OP's PL/I?

Charles

-Original Message-
From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On Behalf 
Of Walt Farrell
Sent: Monday, September 4, 2017 4:00 PM
To: IBM-MAIN@LISTSERV.UA.EDU
Subject: Re: UTF-8 woes on z/OS, a solution - comments invited

Have you considered transferring it to z/OS in binary, rather than converting 
to EBCDIC. Then just process it in its UNICODE format, which either Java or 
Enterprise COBOL should be able to handle (Java by default, COBOL with 
appropriate UNICODE specifications).

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: UTF-8 woes on z/OS, a solution - comments invited

2017-09-04 Thread Walt Farrell

Have you considered transferring it to z/OS in binary, rather than converting 
to EBCDIC. Then just process it in its UNICODE format, which either Java or 
Enterprise COBOL should be able to handle (Java by default, COBOL with 
appropriate UNICODE specifications).

-- 
Walt

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: UTF-8 woes on z/OS, a solution - comments invited

2017-09-04 Thread Charles Mills

Not the way I heard it.

Charles


-Original Message-
From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On Behalf 
Of Paul Gilmartin
Sent: Monday, September 4, 2017 1:35 PM
To: IBM-MAIN@LISTSERV.UA.EDU
Subject: Re: UTF-8 woes on z/OS, a solution - comments invited

On Mon, 4 Sep 2017 13:17:48 -0700, Charles Mills wrote:
>
>... another vulgar cliché, ... indeed SOL ...
>
???
Simply Outa Luck?

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: UTF-8 woes on z/OS, a solution - comments invited

2017-09-04 Thread Paul Gilmartin

On Mon, 4 Sep 2017 13:17:48 -0700, Charles Mills wrote:
>
>... another vulgar cliché, ... indeed SOL ...
>
???
Simply Outa Luck?

-- gil

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: UTF-8 woes on z/OS, a solution - comments invited

2017-09-04 Thread Charles Mills

After I read @Robert's reply to my note I was mentally composing more or less 
what @Gil writes below.

Paraphrasing the vulgar cliché, you can't put 20 bits of data in an 8-bit byte. 
Ultimately, EBCDIC is what it is, and it ain't UTF-8.

I suppose you might be able to create a custom EBCDIC code page that included 
"your" European characters -- assuming no more than thirty or so, and then 
configure z Unicode Services to handle it. Otherwise, to invoke another vulgar 
cliché, you are indeed SOL (without your homegrown, um, solution).

Charles


-Original Message-
From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On Behalf 
Of Paul Gilmartin
Sent: Monday, September 4, 2017 12:26 PM
To: IBM-MAIN@LISTSERV.UA.EDU
Subject: Re: UTF-8 woes on z/OS, a solution - comments invited

On Mon, 4 Sep 2017 20:59:12 +, Robert Prins wrote
>
>I can probably find a set of code-pages that correctly translate the 
>two byte
>UTF-8 "ü" character to a one byte EBCDIC "ü" character, but how would 
>those same two code-pages translate the Polish "ł", the Danish "ø", the 
>Baltic "ė", and the Greek "Θ", which appear in the same PC-side file to 
>one single character... And back to the correct UTF-8 character...
>
>That makes the problem maybe more understandable?
> 
If SBCS is a requirement, then if there is an EBCDIC SBCS code page that 
contains "ü", "ł", "ø", "ė", and "Θ", iconv can probably translate UTF-8 to 
that code page.  Otherwise, you're SOL.

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: UTF-8 woes on z/OS, a solution - comments invited

2017-09-04 Thread Robert Prins


On 2017-09-04 19:24, Paul Gilmartin wrote:
> On Mon, 4 Sep 2017 20:59:12 +, Robert Prins wrote


I can probably find a set of code-pages that correctly translate the two
byte UTF-8 "ü" character to a one byte EBCDIC "ü" character, but how would
those same two code-pages translate the Polish "ł", the Danish "ø", the
Baltic "ė", and the Greek "Θ", which appear in the same PC-side file to one
single character... And back to the correct UTF-8 character...

That makes the problem maybe more understandable?

If SBCS is a requirement, then if there is an EBCDIC SBCS code page that 
contains "ü", "ł", "ø", "ė", and "Θ", iconv can probably translate UTF-8 to

that code page.  Otherwise, you're SOL.


That's why I'm now using the code that I posted. It works, assuming the UTF-8
data is correct. If that isn't the case, then I'm SOL, and the users get what
they deserve, GIGO! ;)

Robert
--
Robert AH Prins
robert(a)prino(d)org

--
Robert AH Prins
robert.ah.prins(a)gmail.com

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: UTF-8 woes on z/OS, a solution - comments invited

2017-09-04 Thread Robert Prins


On 2017-09-04 19:24, Paul Gilmartin wrote:

On Mon, 4 Sep 2017 20:59:12 +, Robert Prins wrote


I can probably find a set of code-pages that correctly translate the two byte
UTF-8 "ü" character to a one byte EBCDIC "ü" character, but how would those same
two code-pages translate the Polish "ł", the Danish "ø", the Baltic "ė", and the
Greek "Θ", which appear in the same PC-side file to one single character... And
back to the correct UTF-8 character...

That makes the problem maybe more understandable?


If SBCS is a requirement, then if there is an EBCDIC SBCS code page that
contains "ü", "ł", "ø", "ė", and "Θ", iconv can probably translate UTF-8 to that
code page.  Otherwise, you're SOL.


That's why I'm now using the code that I posted. It works, assuming the UTF-8 
data is correct. If that isn't the case, then I'm SOL, and the users get what 
they deserve, GIGO! ;)


Robert
--
Robert AH Prins
robert(a)prino(d)org

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: UTF-8 woes on z/OS, a solution - comments invited

2017-09-04 Thread Paul Gilmartin

On Mon, 4 Sep 2017 20:59:12 +, Robert Prins wrote
>
>I can probably find a set of code-pages that correctly translate the two byte
>UTF-8 "ü" character to a one byte EBCDIC "ü" character, but how would those 
>same
>two code-pages translate the Polish "ł", the Danish "ø", the Baltic "ė", and 
>the
>Greek "Θ", which appear in the same PC-side file to one single character... And
>back to the correct UTF-8 character...
>
>That makes the problem maybe more understandable?
> 
If SBCS is a requirement, then if there is an EBCDIC SBCS code page that
contains "ü", "ł", "ø", "ė", and "Θ", iconv can probably translate UTF-8 to that
code page.  Otherwise, you're SOL.

See:  https://en.wikipedia.org/wiki/Pigeonhole_principle

-- gil

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: UTF-8 woes on z/OS, a solution - comments invited

2017-09-04 Thread Robert Prins


On 2017-09-04 17:55, Charles Mills wrote:

I don't understand the problem.


That's correct.


Yes, ü is two bytes (not characters as you wrote!) in UTF-8.


You're correct again.

But if the translation is working correctly and the code page is specified 
correctly it should become one byte in EBCDIC, and assuming the report 
program treats it as a literal of some sort -- does not expect to deduce 
meaning from each byte -- it should be perfectly happy with S?d (pretending

? is an EBCDIC ü) as a district or whatever name. The report columns should
be correct, and it should come back to UTF-8 land as ü, with the proper
number of padding blanks.



It sounds like you are incorrectly translating ü to *two* EBCDIC characters,
and that is the root of your problem. See if you can't translate to an
EBCDIC code page that includes ü.


I can probably find a set of code-pages that correctly translate the two byte
UTF-8 "ü" character to a one byte EBCDIC "ü" character, but how would those same
two code-pages translate the Polish "ł", the Danish "ø", the Baltic "ė", and the 
Greek "Θ", which appear in the same PC-side file to one single character... And 
back to the correct UTF-8 character...


That makes the problem maybe more understandable?

Robert


Charles


-Original Message- From: IBM Mainframe Discussion List 
[mailto:IBM-MAIN@LISTSERV.UA.EDU] On Behalf Of Robert Prins Sent: Monday, 
September 4, 2017 12:34 PM To: IBM-MAIN@LISTSERV.UA.EDU Subject: UTF-8 woes 
on z/OS, a solution - comments invited


OK, I solved the problem, but maybe someone here can come up with something
a bit more efficient...

There is a file in the non-z/OS world, that used to be pure ASCII (actually 
CP437/850), but that has now been converted to UTF-8, due to further 
internationalisation requirements. Said file was uploaded to z/OS, processed 
into a set of datasets containing various reports, and those reports were 
later downloaded to the non-z/OS world, using the same process that was used 
to upload them, which could be one of two, IND$FILE, or FTP.


Both FTP and IND$FILE uploads had (and still have) no problems with 
CP437/850/UTF-8 data, and although an ü might not have displayed as such on 
z/OS, it would have transferred back to the same ü. However, an ü in UTF-8 
now consists of two characters, and that means that, replacing spaces with 
'=' characters, the original


|=Süd| |=Nord===|

report lines now come out as

|=Süd===| |=Nord===|

when opened in the non z/OS world with an UTF-8 aware application.

-- For 
IBM-MAIN subscribe / signoff / archive access instructions, send email to 
lists...@listserv.ua.edu with the message: INFO IBM-MAIN





--
Robert AH Prins
robert(a)prino(d)org

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: UTF-8 woes on z/OS, a solution - comments invited

2017-09-04 Thread Charles Mills

I don't understand the problem. 

Yes, ü is two bytes (not characters as you wrote!) in UTF-8. But if the 
translation is working correctly and the code page is specified correctly it 
should become one byte in EBCDIC, and assuming the report program treats it as 
a literal of some sort -- does not expect to deduce meaning from each byte -- 
it should be perfectly happy with S?d (pretending ? is an EBCDIC ü) as a 
district or whatever name. The report columns should be correct, and it should 
come back to UTF-8 land as ü, with the proper number of padding blanks.

It sounds like you are incorrectly translating ü to *two* EBCDIC characters, 
and that is the root of your problem. See if you can't translate to an EBCDIC 
code page that includes ü.

Charles


-Original Message-
From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On Behalf 
Of Robert Prins
Sent: Monday, September 4, 2017 12:34 PM
To: IBM-MAIN@LISTSERV.UA.EDU
Subject: UTF-8 woes on z/OS, a solution - comments invited

OK, I solved the problem, but maybe someone here can come up with something a 
bit more efficient...

There is a file in the non-z/OS world, that used to be pure ASCII (actually 
CP437/850), but that has now been converted to UTF-8, due to further 
internationalisation requirements. Said file was uploaded to z/OS, processed 
into a set of datasets containing various reports, and those reports were later 
downloaded to the non-z/OS world, using the same process that was used to 
upload them, which could be one of two, IND$FILE, or FTP.

Both FTP and IND$FILE uploads had (and still have) no problems with
CP437/850/UTF-8 data, and although an ü might not have displayed as such on 
z/OS, it would have transferred back to the same ü. However, an ü in UTF-8 now 
consists of two characters, and that means that, replacing spaces with '=' 
characters, the original

|=Süd|
|=Nord===|

report lines now come out as

|=Süd===|
|=Nord===|

when opened in the non z/OS world with an UTF-8 aware application.

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

UTF-8 woes on z/OS, a solution - comments invited

2017-09-04 Thread Robert Prins

OK, I solved the problem, but maybe someone here can come up with something a 
bit more efficient...


There is a file in the non-z/OS world, that used to be pure ASCII (actually 
CP437/850), but that has now been converted to UTF-8, due to further 
internationalisation requirements. Said file was uploaded to z/OS, processed 
into a set of datasets containing various reports, and those reports were later 
downloaded to the non-z/OS world, using the same process that was used to upload 
them, which could be one of two, IND$FILE, or FTP.


Both FTP and IND$FILE uploads had (and still have) no problems with 
CP437/850/UTF-8 data, and although an ü might not have displayed as such on 
z/OS, it would have transferred back to the same ü. However, an ü in UTF-8 now 
consists of two characters, and that means that, replacing spaces with '=' 
characters, the original


|=Süd|
|=Nord===|

report lines now come out as

|=Süd===|
|=Nord===|

when opened in the non z/OS world with an UTF-8 aware application.

Given that, and in this case I was lucky, the PC file had the option to add 
comment-type lines, I solved the problem (the z/OS dataset is processed with 
PL/I) by adding an extra line to the input file of the required comment 
delimiter followed by "ASCII " followed by the 240 ASCII characters from '20'x 
to 'ff'x. The PL/I program uses this "special meta-data comment" to transform 
the input data, which has been translated by IND$FILE/FTP to EBCDIC back into a 
format where all UTF-8 initial characters are translated to '1' and all UTF-8 
follow-on bytes to '0', i.e.


dcl ascii char (240); /* containing the 240 characters from '20'x to 'ff'x, read 
in via an additional comment record in the original non-z/OS file */

dcl utf8  char (240) init (('' ||
'' ||
'' ||
'' ||
'' ||
'0011' ||
'1000'));

and to get the number of UTF-8 displayable characters of, e.g. myvar, a char(47) 
variable, I use the following


dcl a47(47) pic '9';
dcl morechar (20) var;

string(a47) = translate(myvar, utf8, ascii);
more= copy(' ', 47 - sum(a47));

where "more" is the number of extra blanks that needs to be added into the 
report column to ensure that the columns line-out again in the non-z/OS UTF-8 
world. The (relative) beauty of this approach lies in the fact that the 
technique is completely code-page independent, and could even be used with the 
PL/I compiler on Windows.


The above works like a charm, however, both translate() and sum(), especially of 
pic '9' data, are not exactly the most efficient functions, so the question is, 
can anyone think of a more efficient way, other than the quick(?) and dirty 
solution of using a macro on the non-z/OS side, to set "more" the the required 
number of characters. I'm open to a PL/I callable assembler routine, but the 
process must be, like the one above, completely code-page independent!


Robert
--
Robert AH Prins
robert.ah.prins(a)gmail.com

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: UTF-8 woes on z/OS, a solution - comments invited

Re: UTF-8 woes on z/OS, a solution - comments invited

Re: UTF-8 woes on z/OS, a solution - comments invited

Re: UTF-8 woes on z/OS, a solution - comments invited

AW: Re: UTF-8 woes on z/OS, a solution - comments invited

Re: UTF-8 woes on z/OS, a solution - comments invited

Re: UTF-8 woes on z/OS, a solution - comments invited

Re: UTF-8 woes on z/OS, a solution - comments invited

Re: UTF-8 woes on z/OS, a solution - comments invited

Re: UTF-8 woes on z/OS, a solution - comments invited

Re: UTF-8 woes on z/OS, a solution - comments invited

Re: UTF-8 woes on z/OS, a solution - comments invited

Re: UTF-8 woes on z/OS, a solution - comments invited

Re: UTF-8 woes on z/OS, a solution - comments invited

Re: UTF-8 woes on z/OS, a solution - comments invited

Re: UTF-8 woes on z/OS, a solution - comments invited

Re: UTF-8 woes on z/OS, a solution - comments invited

Re: UTF-8 woes on z/OS, a solution - comments invited

Re: UTF-8 woes on z/OS, a solution - comments invited

Re: UTF-8 woes on z/OS, a solution - comments invited

Re: UTF-8 woes on z/OS, a solution - comments invited

Re: UTF-8 woes on z/OS, a solution - comments invited

Re: UTF-8 woes on z/OS, a solution - comments invited

Re: UTF-8 woes on z/OS, a solution - comments invited

Re: UTF-8 woes on z/OS, a solution - comments invited

Re: UTF-8 woes on z/OS, a solution - comments invited

Re: UTF-8 woes on z/OS, a solution - comments invited

Re: UTF-8 woes on z/OS, a solution - comments invited

Re: UTF-8 woes on z/OS, a solution - comments invited

Re: UTF-8 woes on z/OS, a solution - comments invited

Re: UTF-8 woes on z/OS, a solution - comments invited

Re: UTF-8 woes on z/OS, a solution - comments invited

Re: UTF-8 woes on z/OS, a solution - comments invited

UTF-8 woes on z/OS, a solution - comments invited

34 matches

Site Navigation

Mail list logo

Footer information