subject:"Re\: Subject Unicode"

Re: Subject Unicode

2014-01-16 Thread Shmuel Metz (Seymour J.)

In
CAArMM9T5iAWomwY=mpt5lazdbz7xaz0h6b0nhyjws0ymc0o...@mail.gmail.com,
on 01/13/2014
   at 02:27 PM, Tony Harminc t...@harminc.net said:

But no one would say that UTF-8 *is*
ASCII, or that UTF-EBCDIC *is* EBCDIC.

Well, all ASCII characters are valid single octet UTF-8 sequences, so
I would say that ASCII is a subset of UTF-8.mAs for EBCDIC, there were
already multiple EBCDIC code pages prior to Unicode, so there would
seem to be a case for calling UTF-EBCDIC as much EBCDIC as the others.
Does the IBM documentation take a position on that?
 
-- 
 Shmuel (Seymour J.) Metz, SysProg and JOAT
 ISO position; see http://patriot.net/~shmuel/resume/brief.html 
We don't care. We don't have to care, we're Congress.
(S877: The Shut up and Eat Your spam act of 2003)

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-14 Thread Charles Mills

There *are* general ways to convert Unicode into EBCDIC. IBM z/OS Unicode
Services implements several of them. Yes, a Unicode file potentially (but
not necessarily) includes characters not found in a particular EBCDIC code
page. Traditionally, they are converted to EBCDIC SUB, X'3F'. Assuming you
refer to SBCS EBCDIC, the conversion results are likely to be unsatisfying
if the Unicode file is, as is likely, rich in characters with no EBCDIC
equivalent. OTOH EBCDIC DBCS includes a very large subset of common Unicode
characters.

Charles

-Original Message-
From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On
Behalf Of Tony Harminc
Sent: Monday, January 13, 2014 2:27 PM
To: IBM-MAIN@LISTSERV.UA.EDU
Subject: Re: Subject Unicode

On 12 January 2014 10:21, Shmuel Metz (Seymour J.)
shmuel+ibm-m...@patriot.net wrote:
 on 01/09/2014 at 09:00 PM, Tony Harminc t...@harminc.net said:

There is no general way to convert UNICODE into EBCDIC,

 There are EBCDIC transforms for Unicode. I'm not sure whether that
qulifies as EBCDIC.

Exactly as much as UTF-8 qualifies as ASCII, that is to say, not at all. In
both cases (UTF-8 and UTF-EBCDIC), there are several characteristics of the
encoded result that are convenient in the respective environments. In
particular, for legacy applications, the most often used characters in
single-byte ASCII/EBCDIC are encoded by the same byte value in UTF-xxx. But
no one would say that UTF-8 *is* ASCII, or that UTF-EBCDIC *is* EBCDIC.

Tony H.

--
For IBM-MAIN subscribe / signoff / archive access instructions, send email
to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode (Also email. Also TAB.)

2014-01-13 Thread Shmuel Metz (Seymour J.)

In 7503442349556875.wa.paulgboulderaim@listserv.ua.edu, on
01/12/2014
   at 09:55 AM, Paul Gilmartin paulgboul...@aim.com said:

Thereby sacrificing some small economy of storage.  There are even
better arguments for deferring the disambiguation, such as:

o Use of tabs as field separators in exported data bases.

o Rendering in proportional-spaced fonts, particularly when the
  choice of font ls left to the viewer.

o Use of HT to represent HT for applications that treat HT as an
   HT, e.g., EDIT, SCRIPT.
 
-- 
 Shmuel (Seymour J.) Metz, SysProg and JOAT
 ISO position; see http://patriot.net/~shmuel/resume/brief.html 
We don't care. We don't have to care, we're Congress.
(S877: The Shut up and Eat Your spam act of 2003)

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode (Also email. Also TAB.)

2014-01-13 Thread Shmuel Metz (Seymour J.)

In 52d2d540.1020...@t-online.de, on 01/12/2014
   at 06:47 PM, Bernd Oppolzer bernd.oppol...@t-online.de said:

IMO, the idea to put tab characters into files is wrong from the
beginning.

I don't agree; it's useful for text markup. I don't like taking away a
printable character as a logical tab.
 
-- 
 Shmuel (Seymour J.) Metz, SysProg and JOAT
 ISO position; see http://patriot.net/~shmuel/resume/brief.html 
We don't care. We don't have to care, we're Congress.
(S877: The Shut up and Eat Your spam act of 2003)

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-13 Thread Shmuel Metz (Seymour J.)

In 20140111220658.62ce18f...@panix3.panix.com, on 01/11/2014
   at 05:06 PM, Don Poitras poit...@pobox.com said:

I don't know how these characters are going to survive email,

Not without proper[1] MIME header fields; characters like, e.g.,
Copyright (©), Euro (€), Registered (®), Yen (¥), are not ASCII.

[1] E.g.,
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-15
Content-Transfer-Encoding: 8bit

 
-- 
 Shmuel (Seymour J.) Metz, SysProg and JOAT
 ISO position; see http://patriot.net/~shmuel/resume/brief.html 
We don't care. We don't have to care, we're Congress.
(S877: The Shut up and Eat Your spam act of 2003)

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-13 Thread Shmuel Metz (Seymour J.)

In 8160871980876269.wa.paulgboulderaim@listserv.ua.edu, on
01/12/2014
   at 03:28 PM, Paul Gilmartin paulgboul...@aim.com said:

Doesn't understand UNIX line breaks. 

I don't FTP text files as binary. NOTEPAD doesn't introduce fancy
formatting that I didn't request and don't want. For me, that makes it
supperior to WORDPAD.
 
-- 
 Shmuel (Seymour J.) Metz, SysProg and JOAT
 ISO position; see http://patriot.net/~shmuel/resume/brief.html 
We don't care. We don't have to care, we're Congress.
(S877: The Shut up and Eat Your spam act of 2003)

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-13 Thread Shmuel Metz (Seymour J.)

In
CAE1XxDF7qr2ek3mdCFRsgdqUjpReOCmCs5qqfckwMY7sh=t...@mail.gmail.com,
on 01/12/2014
   at 05:11 PM, John Gilmore jwgli...@gmail.com said:

If I argued that the comments prefixed to a routine described its
putative algorithm correctly and that the routine itself could thus
contain no error, Shmuel would still hopefully be quick to point out
the inadequacy of my argument; but here he is guilty of the same sort
of cocksure silliness.

Nonsense; you are conflating a formal specification with a body of
code purporting to impliment that specification. If your real
complaint is that there is code in the wild that does not correctly
impliment the specifications, then be honest enough to say so instead
of playing word games.
 
-- 
 Shmuel (Seymour J.) Metz, SysProg and JOAT
 ISO position; see http://patriot.net/~shmuel/resume/brief.html 
We don't care. We don't have to care, we're Congress.
(S877: The Shut up and Eat Your spam act of 2003)

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-13 Thread Kirk Wolf

On Sun, Jan 12, 2014 at 5:28 PM, Paul Gilmartin paulgboul...@aim.comwrote:

 On Sun, 12 Jan 2014 15:48:49 -0600, Kirk Wolf wrote:

 On Linux gedit works fine, on Windows I use Notepad++ which handles Unix
 eols and UTF-8
 
 You mean I don't have to wait for Windows 14!?  Thanks!

 Does it do UNIX eols on in put *and* output?  Wordpad only does the
 former.


yes.  You can switch in the current document, or you can set the default
for new documents.

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode (Also email. Also TAB.)

2014-01-13 Thread Paul Gilmartin

On Sun, 12 Jan 2014 13:09:40 -0500, Shmuel Metz (Seymour J.) wrote:

Thereby sacrificing some small economy of storage.  There are even
better arguments for deferring the disambiguation, such as:

o Use of tabs as field separators in exported data bases.

o Rendering in proportional-spaced fonts, particularly when the
  choice of font ls left to the viewer.

o Use of HT to represent HT for applications that treat HT as an
   HT, e.g., EDIT, SCRIPT.
 
A considerable refutation of the argument against retaining tabs in files.


On Mon, 13 Jan 2014 07:51:33 -0500, Shmuel Metz (Seymour J.) wrote:

[Notepad] Doesn't understand UNIX line breaks.

I don't FTP text files as binary. NOTEPAD doesn't introduce fancy
formatting that I didn't request and don't want. For me, that makes it
supperior to WORDPAD.
 
Rather than FTPing hither and yon, I share many of my files with NFS
and Samba among UNIX, z/OS, and Windows.  This argues for an
eclectic editor.

-- gil

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-13 Thread Tony Harminc

On 12 January 2014 10:21, Shmuel Metz (Seymour J.)
shmuel+ibm-m...@patriot.net wrote:
 on 01/09/2014 at 09:00 PM, Tony Harminc t...@harminc.net said:

There is no general way to convert UNICODE into EBCDIC,

 There are EBCDIC transforms for Unicode. I'm not sure whether that qulifies 
 as EBCDIC.

Exactly as much as UTF-8 qualifies as ASCII, that is to say, not at
all. In both cases (UTF-8 and UTF-EBCDIC), there are several
characteristics of the encoded result that are convenient in the
respective environments. In particular, for legacy applications, the
most often used characters in single-byte ASCII/EBCDIC are encoded by
the same byte value in UTF-xxx. But no one would say that UTF-8 *is*
ASCII, or that UTF-EBCDIC *is* EBCDIC.

Tony H.

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-13 Thread Kirk Wolf

On Mon, Jan 13, 2014 at 1:27 PM, Tony Harminc t...@harminc.net wrote:

 On 12 January 2014 10:21, Shmuel Metz (Seymour J.)
 shmuel+ibm-m...@patriot.net wrote:
  on 01/09/2014 at 09:00 PM, Tony Harminc t...@harminc.net said:
 
 There is no general way to convert UNICODE into EBCDIC,
 
  There are EBCDIC transforms for Unicode. I'm not sure whether that
 qulifies as EBCDIC.

 Exactly as much as UTF-8 qualifies as ASCII, that is to say, not at
 all. In both cases (UTF-8 and UTF-EBCDIC), there are several
 characteristics of the encoded result that are convenient in the
 respective environments. In particular, for legacy applications, the
 most often used characters in single-byte ASCII/EBCDIC are encoded by
 the same byte value in UTF-xxx. But no one would say that UTF-8 *is*
 ASCII, or that UTF-EBCDIC *is* EBCDIC.

 As a former US president famously said, it depends on what the meaning of
the word 'is' is :-)
It would be perfectly reasonable to say that UTF-8 is a superset of ASCII.
 That was its design - the lower 128 code points are ASCII (7-bits).

Kirk Wolf
Dovetailed Technologies
http://dovetail.com

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-12 Thread Ed Finnell

It might survive as .txt attachment. Everything else gets sliced and  diced.
 
 
In a message dated 1/11/2014 4:15:43 P.M. Central Standard Time,  
poit...@pobox.com writes:

Yeah, I  didn't think that would work. :) If you're reading this
as I am, all the  (well most of) text below ended up as ??.  In


--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode (Also email. Also TAB.)

2014-01-12 Thread Paul Gilmartin

On Sun, 12 Jan 2014 03:48:45 -0500, Ed Finnell wrote:

It might survive as .txt attachment. Everything else gets sliced and  diced.
 
Depends on the MUA.  The text I submitted earlier by email:

== Polyglot ==
A common Russian phrase is ОЧЕНЬ ХОРОШО.
The Greek might be ΠΟΛΥ ΚΑΛΑ.

... made the round trip intact.  I believe it's also preserved by the
web interface.  We'll see now.

On Sun, 12 Jan 2014 15:53:16 +0800, Timothy Sipples wrote:

There's a tab symbol glyph at Unicode point U+21E5. It's a glyph consisting
of a rightwards arrow to a bar. Many keyboards with a Tab key include this
symbol as part of the key label. More information here:

https://en.wikipedia.org/wiki/Arrow_(symbol)

That might be RIGHTWARDS ARROW TO BAR ⇥.

-- gil

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode (Also email. Also TAB.)

2014-01-12 Thread John Gilmore

On the several keyboards I have at hand tab is modal, right or left
depending upon the current shift-key setting.  The modal marking
appears to be

| tab|
| ——   |
| ——   |

in which the 'arrowheads' are solid, not open.   I should think that
'|' would be adequately perspicuous.  The notorious ambiguity of tabs
remains.  Their effects depend upon local tab settings, and many
implementations disambiguate them by replacing them with blanks of
currently equivalent effect in saved/stored files.

John Gilmore, Ashland, MA 01721 - USA

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode (Also email. Also TAB.)

2014-01-12 Thread Paul Gilmartin

On Sun, 12 Jan 2014 10:29:22 -0500, John Gilmore wrote:

...  [Tabs'] effects depend upon local tab settings, and many
implementations disambiguate them by replacing them with blanks of
currently equivalent effect in saved/stored files.
 
Thereby sacrificing some small economy of storage.  There are even
better arguments for deferring the disambiguation, such as:

o Use of tabs as field separators in exported data bases.

o Rendering in proportional-spaced fonts, particularly when the
  choice of font ls left to the viewer.

-- gil

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode (Also email. Also TAB.)

2014-01-12 Thread Bernd Oppolzer


IMO, the idea to put tab characters into files is wrong from the beginning.
But of course it comes from the paper tape paradigma, where a file is 
historically

a paper tape feeding a teletype machine.

With normal local typewriters, a tab is nothing other than a command 
to the
typewriter to point the carriage to a certain position, and that's how 
it should

be implemented in more record oriented environments.

That said, I would like it, if the editors I use replace all tabs to 
blanks when
storing the files, and if there are never any tab characters inside the 
files,
because, when reading them, you have the problem to decide what tab 
positions
this file is meant to have, and you always have to guess, and it's wrong 
most

of the time, and the result looks awful.

Regards tabs separating fields in external representations of database 
records:
there are other possibilities. Commas and semicolons are not nice, but 
it works,
given proper handling of the text fields (and: if the text fields really 
contain text

and no binary data).

In total: I hate tabs and try to avoid them, wherever I can.

Kind regards

Bernd



Am 12.01.2014 16:55, schrieb Paul Gilmartin:

On Sun, 12 Jan 2014 10:29:22 -0500, John Gilmore wrote:

...  [Tabs'] effects depend upon local tab settings, and many
implementations disambiguate them by replacing them with blanks of
currently equivalent effect in saved/stored files.


Thereby sacrificing some small economy of storage.  There are even
better arguments for deferring the disambiguation, such as:

o Use of tabs as field separators in exported data bases.

o Rendering in proportional-spaced fonts, particularly when the
   choice of font ls left to the viewer.

-- gil

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN



--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-12 Thread Shmuel Metz (Seymour J.)

In 1389314155.47172.yahoomail...@web126205.mail.ne1.yahoo.com, on
01/09/2014
   at 04:35 PM, Scott Ford scott_j_f...@yahoo.com said:

PC ( data using a foreign language Unicode page

What are you trying to say? If the PC is using Unicode then it will
transimit data as UTF-7 or UTF-8, which covers the entire BMP and
beyond. Are you really asking about translating between ISO-8859 code
pages and Unicode?
 
-- 
 Shmuel (Seymour J.) Metz, SysProg and JOAT
 ISO position; see http://patriot.net/~shmuel/resume/brief.html 
We don't care. We don't have to care, we're Congress.
(S877: The Shut up and Eat Your spam act of 2003)

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-12 Thread Shmuel Metz (Seymour J.)

In 022c01cf0da5$a7b25180$f716f480$@mcn.org, on 01/09/2014
   at 05:45 PM, Charles Mills charl...@mcn.org said:

There are several flavors of Unicode, but they relate to how the
code points are stored in a file or transmitted, not to the
character set.

Actually, those are transforms rather than different flavors of
Unicode. Unicode does come in distinct numbered versions, but AFAIK a
code point defined in an older version will always be present in the
more recent versions.

(someone will no doubt correct me with the exact number in use)

That would be a moving target; Unicode does not currently assign all
code points in the BMP, much less the full 20-bit range.

and you could make the first part of the character set the same as
ASCII, which would make it intuitive for PC folks who know that A
is X'41'. That is called UTF-8,

UTF-8 uses non-ASCII characters to represent code points higher that
127; UTF-7 uses only ASCII characters. I hpe he's not using UTF-7.
 
-- 
 Shmuel (Seymour J.) Metz, SysProg and JOAT
 ISO position; see http://patriot.net/~shmuel/resume/brief.html 
We don't care. We don't have to care, we're Congress.
(S877: The Shut up and Eat Your spam act of 2003)

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-12 Thread Shmuel Metz (Seymour J.)

In 0e75a300-f7c5-46a7-a1d3-7189d2a58...@yahoo.com, on 01/09/2014
   at 08:39 PM, Scott Ford scott_j_f...@yahoo.com said:

We send a data message from a pc, we encrypt it with AES128 , the
message is received at the host (z/OS) decrypted then converted from
ascii to ebcdic

If it really was ASCII then it would be cut and dried. If it's
anything else then you need to know what it is in order to convert.
 
-- 
 Shmuel (Seymour J.) Metz, SysProg and JOAT
 ISO position; see http://patriot.net/~shmuel/resume/brief.html 
We don't care. We don't have to care, we're Congress.
(S877: The Shut up and Eat Your spam act of 2003)

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-12 Thread Shmuel Metz (Seymour J.)

In 20140110034419.c71008f...@panix3.panix.com, on 01/09/2014
   at 10:44 PM, Don Poitras poit...@pobox.com said:

As of z/OS 2.1, ISPF supports UTF-8, so a binary transfer will still
show an A if it was an A on the PC.

Only if the PC was using UTF-8 or translates to Unicode with UTF-8 as
part of the transmission.
 
-- 
 Shmuel (Seymour J.) Metz, SysProg and JOAT
 ISO position; see http://patriot.net/~shmuel/resume/brief.html 
We don't care. We don't have to care, we're Congress.
(S877: The Shut up and Eat Your spam act of 2003)

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-12 Thread Shmuel Metz (Seymour J.)

In
CAArMM9QOFq1jtzwmj=LWHTKkadMzx=aaqppbbqjnk+c8kuz...@mail.gmail.com,
on 01/09/2014
   at 09:00 PM, Tony Harminc t...@harminc.net said:

There is no general way to convert UNICODE into EBCDIC,

There are EBCDIC transforms for Unicode. I'm not sure whether that
qulifies as EBCDIC.
 
-- 
 Shmuel (Seymour J.) Metz, SysProg and JOAT
 ISO position; see http://patriot.net/~shmuel/resume/brief.html 
We don't care. We don't have to care, we're Congress.
(S877: The Shut up and Eat Your spam act of 2003)

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-12 Thread Shmuel Metz (Seymour J.)

In
of065337e4.0e9ce2ec-on48257c5c.0027bfa8-48257c5c.00297...@sg.ibm.com,
on 01/10/2014
   at 03:30 PM, Timothy Sipples sipp...@sg.ibm.com said:

Somehow I'm reminded of the save two characters impulse which then
caused a lot of angst in preparing for Y2K.

The situations are not comparable. With 2-digit years there was an
actual truncation of the data. With UTF-7 or UTF-8, all of the data
are still present, but there is an efficiency issue.
 
-- 
 Shmuel (Seymour J.) Metz, SysProg and JOAT
 ISO position; see http://patriot.net/~shmuel/resume/brief.html 
We don't care. We don't have to care, we're Congress.
(S877: The Shut up and Eat Your spam act of 2003)

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-12 Thread Shmuel Metz (Seymour J.)

In bay177-w36d24c9a61f3b6464b4992f2...@phx.gbl, on 01/10/2014
   at 09:36 AM, Harry Wahl harry_w...@hotmail.com said:

You could use the BOM UTF characters 

There are none.  U+FEFF ZERO WIDTH NO-BREAK SPACE is a Unicode
character.

usually inserted transparently at the beginning of a UTF file. 

Usually inserted *only* at the beginning of a file transmitted as
UCS-2, UCS-4, UTF-16 or UTF-32. Also, see the restrictions in RFC 3629 
(STD 63,) UTF-8, a transformation format of ISO 10646, 6.  Byte order
mark (BOM).
 
-- 
 Shmuel (Seymour J.) Metz, SysProg and JOAT
 ISO position; see http://patriot.net/~shmuel/resume/brief.html 
We don't care. We don't have to care, we're Congress.
(S877: The Shut up and Eat Your spam act of 2003)

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-12 Thread Shmuel Metz (Seymour J.)

In 9931357931112854.wa.paulgboulderaim@listserv.ua.edu, on
01/10/2014
   at 08:41 AM, Paul Gilmartin paulgboul...@aim.com said:

Notepad?  What's that?  Perhaps some obsolete predecessor of
Wordpad?

No, it's a superior version of wordpad. HTH.
 
-- 
 Shmuel (Seymour J.) Metz, SysProg and JOAT
 ISO position; see http://patriot.net/~shmuel/resume/brief.html 
We don't care. We don't have to care, we're Congress.
(S877: The Shut up and Eat Your spam act of 2003)

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-12 Thread Shmuel Metz (Seymour J.)

In
CAE1XxDHn0wgwJpm+cNLSdzv=ccvoz1u5o6em7xwxnqs4u0z...@mail.gmail.com,
on 01/10/2014
   at 09:50 AM, John Gilmore jwgli...@gmail.com said:

As soon, however, as you need to support

o three or more different  roman-alphabet natural languages, or

o a roman-alphabet language and a non-alphabetic Asian language

you need UTF-16.

Nonsense; it's strictly an efficiency issue, and depends on the
relative frequencies with which you use various characters and the
degree to which you do locale-dependent functions.
 
-- 
 Shmuel (Seymour J.) Metz, SysProg and JOAT
 ISO position; see http://patriot.net/~shmuel/resume/brief.html 
We don't care. We don't have to care, we're Congress.
(S877: The Shut up and Eat Your spam act of 2003)

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-12 Thread Shmuel Metz (Seymour J.)

In
CAE1XxDHBJRiuFH7N-031xdJ3DUnO6QyG-=otde8fvnc-uyv...@mail.gmail.com,
on 01/10/2014
   at 11:02 AM, John Gilmore jwgli...@gmail.com said:

The problem is not one of representability but of subset choice. 

There is no problem of subset choice, because use of UTF-8 does not
imply a proper subset of Unicode; it is a tranform for all 2^20
minus[1] code points.

[1] U+D800 through U+DFFF are not valid
 
-- 
 Shmuel (Seymour J.) Metz, SysProg and JOAT
 ISO position; see http://patriot.net/~shmuel/resume/brief.html 
We don't care. We don't have to care, we're Congress.
(S877: The Shut up and Eat Your spam act of 2003)

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-12 Thread Shmuel Metz (Seymour J.)

In
cae1xxdgwscr+ffp13_rperg4jmkferdgp4f6sxtz7v48o4g...@mail.gmail.com,
on 01/10/2014
   at 01:28 PM, John Gilmore jwgli...@gmail.com said:

Briefly, effective rules for encoding any 'character' recognized as a
Unicode one as a 'longer' UTF-8 one do not in general exist.

What are you drinking? RFC 3629 spells them out in excruciating
detail. 

In dealing recently with a document containing mixed
English, German, Korean and Japanese text I found that the UTF-8
version was 23% longer than the UTF-16 version.

That simply an efficiency issue; you need UTF-16 is a much strong
claim than UTF-16 may be more efficient. Further, a sample size of
one is grossly inadequate for drawing statistical conclusions. Try
documents that are mostly English, French and German with a smattering
of CJK languages and you will get different results.
 
-- 
 Shmuel (Seymour J.) Metz, SysProg and JOAT
 ISO position; see http://patriot.net/~shmuel/resume/brief.html 
We don't care. We don't have to care, we're Congress.
(S877: The Shut up and Eat Your spam act of 2003)

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-12 Thread Shmuel Metz (Seymour J.)

In 20140110195944.3d5f333...@panix2.panix.com, on 01/10/2014
   at 02:59 PM, Don Poitras poit...@pobox.com said:

As far as 3270 goes, I think it's just going to us the CODEPAGE and
CHARSET you start ISPF with. I think it's going to be limited to the
set of EBCDIC code pages. As this is the first release, I'm sure
there's stuff missing that will be added as time goes by.

Proper support would require enhancing the 3270 display stream.
 
-- 
 Shmuel (Seymour J.) Metz, SysProg and JOAT
 ISO position; see http://patriot.net/~shmuel/resume/brief.html 
We don't care. We don't have to care, we're Congress.
(S877: The Shut up and Eat Your spam act of 2003)

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode (Also email. Also TAB.)

2014-01-12 Thread Bernd Oppolzer


Two short additions:

first: Regards in the 4th paragraph is a sort of typo,
should read Regarding

second: from the moment on when we terminated to exchange
files by paper tape, we should have stopped to put tabs into files
from that same moment on - if not before. My opinion ...

Kind regards

Bernd



Am 12.01.2014 18:47, schrieb Bernd Oppolzer:
IMO, the idea to put tab characters into files is wrong from the 
beginning.
But of course it comes from the paper tape paradigma, where a file is 
historically

a paper tape feeding a teletype machine.

With normal local typewriters, a tab is nothing other than a command 
to the
typewriter to point the carriage to a certain position, and that's how 
it should

be implemented in more record oriented environments.

That said, I would like it, if the editors I use replace all tabs to 
blanks when
storing the files, and if there are never any tab characters inside 
the files,
because, when reading them, you have the problem to decide what tab 
positions
this file is meant to have, and you always have to guess, and it's 
wrong most

of the time, and the result looks awful.

Regards tabs separating fields in external representations of database 
records:
there are other possibilities. Commas and semicolons are not nice, but 
it works,
given proper handling of the text fields (and: if the text fields 
really contain text

and no binary data).

In total: I hate tabs and try to avoid them, wherever I can.

Kind regards

Bernd



Am 12.01.2014 16:55, schrieb Paul Gilmartin:

On Sun, 12 Jan 2014 10:29:22 -0500, John Gilmore wrote:

...  [Tabs'] effects depend upon local tab settings, and many
implementations disambiguate them by replacing them with blanks of
currently equivalent effect in saved/stored files.


Thereby sacrificing some small economy of storage.  There are even
better arguments for deferring the disambiguation, such as:

o Use of tabs as field separators in exported data bases.

o Rendering in proportional-spaced fonts, particularly when the
   choice of font ls left to the viewer.

-- gil

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN



--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN



--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode (Also email. Also TAB.)

2014-01-12 Thread Ted MacNEIL

you have the problem to decide what tab 
positions this file is meant to have, and you always have to guess, and it's 
wrong most
of the time, and the result looks awful

Your solution would also look awful with proportional text.
-
Ted MacNEIL
eamacn...@yahoo.ca
Twitter: @TedMacNEIL

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode (Also email. Also TAB.)

2014-01-12 Thread Bernd Oppolzer


Am 12.01.2014 19:10, schrieb Ted MacNEIL:

you have the problem to decide what tab
positions this file is meant to have,
and you always have to guess, and it's wrong most
of the time, and the result looks awful



Your solution would also look awful with proportional text.


My focus is on source code most of the time;
there I am lost with proportional fonts, anyway.

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode (Also email. Also TAB.)

2014-01-12 Thread Paul Gilmartin

On Sun, 12 Jan 2014 18:59:25 +0100, Bernd Oppolzer wrote:

second: from the moment on when we terminated to exchange
files by paper tape, we should have stopped to put tabs into files
from that same moment on - if not before. My opinion ...
 
Why?  Where else would you keep them?

 Regards tabs separating fields in external representations of database 
 records:
 there are other possibilities. Commas and semicolons are not nice, but it 
 works,
 
More seriously, if the data fields legitimately contain commas and/or
semicolons, tab is a more useful field separator.  If the data contain
tabs?  Well, that's a good place to apply your argument for avoiding
tabs.  Legibility?  That goes directly back to the OP's question.

-- gil

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-12 Thread Paul Gilmartin

On Sun, 12 Jan 2014 10:45:23 -0500, Shmuel Metz (Seymour J.) wrote:

Notepad?  What's that?  Perhaps some obsolete predecessor of
Wordpad?

No, it's a superior version of wordpad. HTH.
 
Doesn't understand UNIX line breaks.  For me that's a deal breaker.

-- gil

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-12 Thread Kirk Wolf

On Linux gedit works fine, on Windows I use Notepad++ which handles Unix
eols and UTF-8

Kirk Wolf
Dovetailed Technologies
http://dovetail.com


On Sun, Jan 12, 2014 at 3:28 PM, Paul Gilmartin paulgboul...@aim.comwrote:

 On Sun, 12 Jan 2014 10:45:23 -0500, Shmuel Metz (Seymour J.) wrote:
 
 Notepad?  What's that?  Perhaps some obsolete predecessor of
 Wordpad?
 
 No, it's a superior version of wordpad. HTH.
 
 Doesn't understand UNIX line breaks.  For me that's a deal breaker.

 -- gil

 --
 For IBM-MAIN subscribe / signoff / archive access instructions,
 send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode (Also email. Also TAB.)

2014-01-12 Thread John Gilmore

Tabs are useful for formatting input text.  I use tab settings of 10,
16. 35, and 72 for HLASM source formatting; but I will not use a text
editor that does not---optionally for those who have other
preferences---replace tabs with blanks during save/storage operations.

Bernd and I are thus in complete agreement about this issue

More generally, the notion of usurping the traditional function oif a
character to use it for another purpose where it is safe to do so
is, I think, a dubious, even irresponsible one.

Doing so always has untoward, sometimes tragic consequences.  I am
sure, for example, that the people who 'extended' C on the cheap to
support strings of conceptually unlimited length with EOS delimited by
a nul, x'00' in an SBCS or x'' in a DBCS, thought their idea was
benign.  In the event they sowed badly, and we are all reaping the
whirlwind.

John Gilmore, Ashland, MA 01721 - USA

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-12 Thread Kirk Wolf

BTW, Notepad++ is not only free/open source, but it also has the goal of
preventing Global Warming :-)

http://notepad-plus-plus.org/

.. while at the same time likes to show off:

http://notepad-plus-plus.org/features/column-mode-editing.html



Kirk Wolf
Dovetailed Technologies
http://dovetail.com


On Sun, Jan 12, 2014 at 3:48 PM, Kirk Wolf k...@dovetail.com wrote:

 On Linux gedit works fine, on Windows I use Notepad++ which handles Unix
 eols and UTF-8

 Kirk Wolf
 Dovetailed Technologies
 http://dovetail.com


 On Sun, Jan 12, 2014 at 3:28 PM, Paul Gilmartin paulgboul...@aim.comwrote:

 On Sun, 12 Jan 2014 10:45:23 -0500, Shmuel Metz (Seymour J.) wrote:
 
 Notepad?  What's that?  Perhaps some obsolete predecessor of
 Wordpad?
 
 No, it's a superior version of wordpad. HTH.
 
 Doesn't understand UNIX line breaks.  For me that's a deal breaker.

 -- gil

 --
 For IBM-MAIN subscribe / signoff / archive access instructions,
 send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN




--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-12 Thread John Gilmore

I don't generally respond to Shmuel's animadversions.  This time,
however, he has crossed the line of civilized behavior.

His experience with Unicode appears to be limited to attentive reading
of its defining documents.  Its implementations are, unsurprisingly,
imperfect.  In particular the several UTF-8 implementations with which
I am more familiar than I should wish to be are very imperfect indeed.

If I argued that the comments prefixed to a routine described its
putative algorithm correctly and that the routine itself could thus
contain no error, Shmuel would still hopefully be quick to point out
the inadequacy of my argument; but here he is guilty of the same sort
of cocksure silliness.  He wondered what I had been drinking.  I
wonder if he is not suffering from senile dementia.  He is certainly
exhibiting some of its characteristic symptoms.

John Gilmore, Ashland, MA 01721 - USA

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-12 Thread Paul Gilmartin

On Sun, 12 Jan 2014 15:48:49 -0600, Kirk Wolf wrote:

On Linux gedit works fine, on Windows I use Notepad++ which handles Unix
eols and UTF-8
 
You mean I don't have to wait for Windows 14!?  Thanks!

Does it do UNIX eols on in put *and* output?  Wordpad only does the
former.

Thanks again,
gil

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-12 Thread Bernd Oppolzer


Some words about editors, tabs, eolchar, eofchar.

The editor which I like most does the following:

- read files that have CRLF eols or LF eols
- output files with CRLF or LF, controlled by an editor setting
- output an EOF char, if desired (0x1a); most of the time, I don't want it
- allow individual tab settings (of course), either by specifying the 
tab positions

or by specifying an increment
- translate tabs to spaces during reading, if desired
- translate tabs to spaces during writing, if desired
- translate spaces to tabs during writing, if desired (well, I wouldn't 
use that)

- omit trailing blanks, if desired
- or: set all records to a fixed length, controlled by editor option 
(filling with blanks

or truncated, if necessary)

Most (simple) editors don't have such features, but if you use the editor
to prepare for example data files for input and the programs you have rely
on such specific details of the input files, you are happy if you have 
an editor

at hand that allows you to do the necessary modifications.

Another remark regarding tabs in (source) files:

they have no meaning to the compilers etc.; the compilers treat them like
spaces in the best case. So the only reason for having them in the source
is for formatting purposes, and that - as we pointed out already - does not
work, because the tab settings at the time of the creation of the file 
are not

known, so you will get garbage (for the human reader) in the general case.

That's why we should IMO avoid tabs in source files. Sometimes I get
C programs which I have to port to z/OS, for example; one of the first 
steps is:

remove the tabs in the sources, restore the indentation of the source and
limit the source line length to 72 or 80. At least, I invest some time 
to do this,
if I plan to take the responsibility for those programs for a longer 
time and

if I have to pass them through our normal change management and
source archive systems.

Kind regards

Bernd




Am 13.01.2014 00:28, schrieb Paul Gilmartin:

On Sun, 12 Jan 2014 15:48:49 -0600, Kirk Wolf wrote:


On Linux gedit works fine, on Windows I use Notepad++ which handles Unix
eols and UTF-8


You mean I don't have to wait for Windows 14!?  Thanks!

Does it do UNIX eols on in put *and* output?  Wordpad only does the
former.

Thanks again,
gil

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN



--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-11 Thread Paul Gilmartin

(Cross posting to ISPF-L and IBM-MAIN)

On 2014-01-10, at 12:59, Don Poitras wrote:
 
 As of z/OS 2.1, ISPF supports UTF-8, so a binary transfer will still show 
 an A if it
 was an A on the PC.  ... 

 What representation does it use in the 3270 data streams?  Is
 this well documented in the Data Streams reference?  What must
 it do to avoid embedded 3270 command bytes?  Is this compatible
 with Yale/7271/IND$FILE/Kermit conventions?
 
 As far as 3270 goes, I think it's just going to us the CODEPAGE
 and CHARSET you start ISPF with. I think it's going to be limited
 to the set of EBCDIC code pages. As this is the first release, I'm
 sure there's stuff missing that will be added as time goes by.
  
I guess that conforms to someone's notion of support.  Should
I understand that one can edit UTF-8 files; one just can't see
most of the characters.  I guess any meaningful editing must be
done with macros.

(I don't yet have access to 2.1.)  What happens if I turn HEX ON?
Will it show the value of the Unicode code point, or of the
UTF-8 sequence of bytes.  Generally, neither can be represented
in two hex digits.


On 2014-01-10, at 16:19, Steve Comstock wrote:
 
 BTW, how can I convert majuscule-minuscule with ISPF EDIT.
 I know; I could write a macro ...  Sheesh!
 
 Well, on a command line:
  c p'' p'' all
 
 Or, as a line command:
 LCC
 ...
 LCC
 should do it.
  
Thanks.  I hadn't known about that.  So if my UTF-8 file I have:
== Polyglot ==
A common Russian phrase is ОЧЕНЬ ХОРОШО.
The Greek might be ΠΟΛΥ ΚΑΛΑ.

... will those commands transform it to:
== polyglot ==
a common russian phrase is очень хорошо.
the greek might be πολυ καλα.

... even as Vim and LibreOffice do, and even if I can't see it?

-- gil

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-11 Thread Don Poitras

In article e488911a-b303-4d2f-8cf9-247154ab8...@aim.com you wrote:
 (Cross posting to ISPF-L and IBM-MAIN)

 On 2014-01-10, at 12:59, Don Poitras wrote:
  
  As of z/OS 2.1, ISPF supports UTF-8, so a binary transfer will still show 
  an A if it
  was an A on the PC.  ... 

  What representation does it use in the 3270 data streams?  Is
  this well documented in the Data Streams reference?  What must
  it do to avoid embedded 3270 command bytes?  Is this compatible
  with Yale/7271/IND$FILE/Kermit conventions?
  
  As far as 3270 goes, I think it's just going to us the CODEPAGE
  and CHARSET you start ISPF with. I think it's going to be limited
  to the set of EBCDIC code pages. As this is the first release, I'm
  sure there's stuff missing that will be added as time goes by.
   
 I guess that conforms to someone's notion of support.  Should
 I understand that one can edit UTF-8 files; one just can't see
 most of the characters.  I guess any meaningful editing must be
 done with macros.

 (I don't yet have access to 2.1.)  What happens if I turn HEX ON?
 Will it show the value of the Unicode code point, or of the
 UTF-8 sequence of bytes.  Generally, neither can be represented
 in two hex digits.

I don't know how these characters are going to survive email, so
I'll describe what I did. Just editing all the hex from 00 to FF
in EBCDIC mode, you end up with lots of glyphs that are two-byte
in UTF-8. I copied one line using my emulator cut and paste and
pasted the glyphs in a new member that I specified to be created
using UTF-8. I then used the text split line command to put the
first 5 glyphs each on a single line. The glyphs are:
1. logical not
2. pound (english money not weight)
3. Yen
4. Middle dot
5. Copyright

I had to position the cursor on the correct hex byte to properly
do the text-split. It's real easy to mess up the file. 

EDIT   SASDTP.ISPF.CNTL(UTF8) - 01.02
Command === 
** * Top of Data 
01 ??]  
   CACACACBCACACBCBCBCBC9CACA5CBC9222
   2C2325272927262C2D2E3D282FD2437000
-
02 ?? 
   CA
   2C
-
03 ?? 
   CA
   23
-
04 ?? 
   CA
   25
-
05 ?? 
   CB
   27
-
06 ?? 
   CA
   29





 On 2014-01-10, at 16:19, Steve Comstock wrote:
  
  BTW, how can I convert majuscule-minuscule with ISPF EDIT.
  I know; I could write a macro ...  Sheesh!
  
  Well, on a command line:
   c p'' p'' all
  
  Or, as a line command:
  LCC
  ...
  LCC
  should do it.
   
 Thanks.  I hadn't known about that.  So if my UTF-8 file I have:
 == Polyglot ==
 A common Russian phrase is ? ??.
 The Greek might be  .

 ... will those commands transform it to:
 == polyglot ==
 a common russian phrase is ? ??.
 the greek might be  .

 ... even as Vim and LibreOffice do, and even if I can't see it?

 -- gil

-- 
Don Poitras - SAS Development  -  SAS Institute Inc. - SAS Campus Drive
sas...@sas.com   (919) 531-5637Cary, NC 27513

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-11 Thread Don Poitras

Yeah, I didn't think that would work. :) If you're reading this
as I am, all the (well most of) text below ended up as ??. In
actuality, every ?? was a single width. The first line contains
16 characters with 32 hex bytes underneath. The subsequent lines
are all a single character with 2 hex bytes shown. You can type
in 3-byte UTF-8 codes, but they won't show anything in the text
fields.

 I don't know how these characters are going to survive email, so
 I'll describe what I did. Just editing all the hex from 00 to FF
 in EBCDIC mode, you end up with lots of glyphs that are two-byte
 in UTF-8. I copied one line using my emulator cut and paste and
 pasted the glyphs in a new member that I specified to be created
 using UTF-8. I then used the text split line command to put the
 first 5 glyphs each on a single line. The glyphs are:
 1. logical not
 2. pound (english money not weight)
 3. Yen
 4. Middle dot
 5. Copyright

 I had to position the cursor on the correct hex byte to properly
 do the text-split. It's real easy to mess up the file. 

 EDIT   SASDTP.ISPF.CNTL(UTF8) - 01.02
 Command === 
 ** * Top of Data 
 01 ??]  
CACACACBCACACBCBCBCBC9CACA5CBC9222
2C2325272927262C2D2E3D282FD2437000
 -
 02 ?? 
CA
2C
 -
 03 ?? 
CA
23
 -
 04 ?? 
CA
25
 -
 05 ?? 
CB
27
 -
 06 ?? 
CA
29

-- 
Don Poitras - SAS Development  -  SAS Institute Inc. - SAS Campus Drive
sas...@sas.com   (919) 531-5637Cary, NC 27513

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-10 Thread Hunkeler, Peter

Other than with a lot of inferential cleverness, there is no way to look at an 
ASCII-like file and tell what the code page is. 

The same applies to data encoded in EBCDIC. In fact, files are nothing but a 
series of bytes. You always need to know what those byes represent in order to 
be able work on the in a meaningful way. 

Especially in the distributed world, some conventions have been established 
that help programs in guessing what the file content might be. The first couple 
of bytes contain a certain byte sequence to identify the type of the file. But 
still, there is no guarantee the rest of the file matches that indication. 
Unfortunately, no such convention exists for pure text data. Neither a 
convention to indicate this is text nor to tell the encoding / code page used.

--
Peter Hunkeler

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-10 Thread Harry Wahl

You could use the BOM UTF characters to determine whether a file is UTF or 
not, and what form of UTF (UTF-8, UTF-16, UTF-32, big-edian or little-edian) is 
being used. 
The BOM characters are the UTF defined characters usually inserted 
transparently at the beginning of a UTF file. Granted this is not a perfect 
answer, but it may help for want of any other way to determine if a file is UTF 
or not.
However, BOM characters are not always present, some platforms always have them 
(Microsoft) and some platforms eschew them. Windows Notepad is particularly 
tricky because it adds them without you realizing it. So whether you look at a 
file with Notepad (or other simple editor) or don't can both affect your 
results and cause you to question your sanity because you didn't realize this.
BOM characters can be very useful. For example an XML header defines character 
encoding, but BOM characters can be used to determine the character encoding of 
the XML header itself.
For UTF-8 the BOM character can be used to determine if a file is UTF encoded 
or not. But, for UTF-16 and UTF-32, it also allows you to determine the 
edianness of the UTF code units.
Harry 

 Date: Fri, 10 Jan 2014 08:01:42 +
 From: peter.hunke...@credit-suisse.com
 Subject: Re: Subject Unicode
 To: IBM-MAIN@LISTSERV.UA.EDU
 
 Other than with a lot of inferential cleverness, there is no way to look at 
 an ASCII-like file and tell what the code page is. 
 
 The same applies to data encoded in EBCDIC. In fact, files are nothing but a 
 series of bytes. You always need to know what those byes represent in order 
 to be able work on the in a meaningful way. 
 
 Especially in the distributed world, some conventions have been established 
 that help programs in guessing what the file content might be. The first 
 couple of bytes contain a certain byte sequence to identify the type of the 
 file. But still, there is no guarantee the rest of the file matches that 
 indication. Unfortunately, no such convention exists for pure text data. 
 Neither a convention to indicate this is text nor to tell the encoding / code 
 page used.
 
 --
 Peter Hunkeler
 
 --
 For IBM-MAIN subscribe / signoff / archive access instructions,
 send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
  
--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-10 Thread Charles Mills

John, if you are saying that there are some Unicode characters that cannot
be represented in UTF-8 then that is incorrect. *Any* Unicode character --
pretty much any character in the world -- may be represented in UTF-8. For
external representations of Unicode the battle is pretty much over and UTF-8
won.

Charles

-Original Message-
From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On
Behalf Of John Gilmore
Sent: Friday, January 10, 2014 6:51 AM
To: IBM-MAIN@LISTSERV.UA.EDU
Subject: Re: Subject Unicode

I have refrained from saying anything about this topic because I judged that
anything I said would be predictable.  I am a well-known offender, a
flagrant Unicode, i.e., minimally UTF-16, advocate.

Now, however, Charles Mills has pushed me into posting something.   He
writes

begin extract
That is called UTF-16. Pretty good but still not very efficient.
/end extract

As usual, it depends.  If one's problems are always with a single pair of
natural languages, one of which is English (ENG or ENU), which makes little
use of orthographically marked letters, a satisfactory
UTF-8 'solution' may be, indeed usually is, possible.

Something can, that is, be done in a UTF-8 framework with such languiage
pairs as

o English and French.

o English and German, or even

o English and Polish.

As soon, however, as you need to support

o three or more different  roman-alphabet natural languages, or

o a roman-alphabet language and a non-alphabetic Asian language

you need UTF-16.

To put the matter more brutally, any new system being built today and in
particular any new system that is likely to interact, at whatever remove,
with web-based systems should use UTF-16.

The notion that the only efficient representation for character data is an
SBCS one is retrograde at best.  Continuing with it will make trouble for
those who do so; worse, it will ensure that the systems they build are
short-lived.  The ASCII vs EBCDIC dispute is no longer of much interest.
They are both obsolescent, usable safely only in what the international
lawyers call municipal contexts.

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-10 Thread Charles Mills

Fair enough. I was answering a question about French Unicode at five
o'clock. I certainly don't mean to get hung up on efficiency and yes, for
certain character distributions, UTF-16 yields a shorter file or message
length than UTF-8.

Charles

-Original Message-
From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On
Behalf Of Timothy Sipples
Sent: Thursday, January 09, 2014 11:31 PM
To: IBM-MAIN@LISTSERV.UA.EDU
Subject: Re: Subject Unicode

Charles Mills writes:
You could use 16 bits for every character, with some sort of cleverness 
that yielded two 16-bit words when you had a code point bigger than 
65535 (actually somewhat less due to how the cleverness works). That is 
called UTF-16. Pretty good but still not very efficient.

In Japan and China, to pick a couple examples, UTF-16 is rather efficient.
There are also far worse inefficiencies than using 16 bits to store each
Latin character. In short, I wouldn't get *too* hung up on this point,
especially as the complete lifecycle costs of storage continue to fall.

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-10 Thread Kirk Wolf

Gil:

Co:Z SFTP and DatasetPipes both support any single-byte encoding as well as
UTF-8 when converting to/from datasets.  You can use either iconv or
unicode system services, including custom tables and techniques.

Scott:

What is a foreign language Unicode page?  Can you give a specific example?


Kirk Wolf
Dovetailed Technologies
http://dovetail.com


On Thu, Jan 9, 2014 at 6:47 PM, Paul Gilmartin paulgboul...@aim.com wrote:

 On Thu, 9 Jan 2014 16:35:55 -0800, Scott Ford wrote:

 All:
 �
 I have a fundamental question on Unicode, or more of how it works . I am
 confused about the following scenario..
 PC ( data using a foreign language Unicode page, like French )� going to
 z/OS and being keep in tact. Names and address type data. As the
 application do I have to query the incoming data and find out what the
 Unicode CECP is then translate to the desired ? or how does it work ?

 I believe, yes.  What is the desired ?

 iconv may be your friend here, either as a shell command or as a library
 subroutine,
 after transferring the file in BINARY.

 Will Co:Z let the user specify the target code page when transferring a
 file?

 -- gil

 --
 For IBM-MAIN subscribe / signoff / archive access instructions,
 send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-10 Thread Anne Lynn Wheeler

historical reference 1960-1979
http://www.bobbemer.com/REGISTRY.HTM

ibm major driver behind all this
http://www.bobbemer.com/ZACHERLY.HTM

however, Learson had problem and made decision to temporarily go with
EBCDIDC w/o realizing what he had done (The Biggest Computer Goof Ever)
... and the company got stuck with it
http://www.bobbemer.com/P-BIT.HTM

lots of other history
http://www.bobbemer.com/HISTORY.HTM

-- 
virtualization experience starting Jan1968, online at home since Mar1970

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-10 Thread John Gilmore

Charles

I do not think you read my post at all carefully.

I made it clear that for specific language pairs UTF-8 is adequate if
often clumsy.

For multiple-language environments it is equally clear that it is inadequate.

It is of course true that any grapheme, even say some company's logo
or an astrological house, can be represented in UTF-8.  The problem is
not one of representability but of subset choice.  The decision to
include one may preclude the inclusion of another.  Some subsets of at
most 256 characters are adequate to some particular tasks and others
are adequate to other particular tasks.  None is adequate to all such
tasks.

Moreover, in my now considerable controversial experience I have noted
that people who assert that 1) the real meaning of some word is what
they want it to be or 2) that a battle is pretty much over and their
side has won are are arguing hopefully, trying to convince others, not
recording the judgment of history.

John Gilmore, Ashland, MA 01721 - USA

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-10 Thread Paul Gilmartin

On Fri, 10 Jan 2014 11:02:57 -0500, John Gilmore wrote:

Charles

I do not think you read my post at all carefully.

I made it clear that for specific language pairs UTF-8 is adequate if
often clumsy.

For multiple-language environments it is equally clear that it is inadequate.

It is of course true that any grapheme, even say some company's logo
or an astrological house, can be represented in UTF-8.  The problem is
not one of representability but of subset choice.  The decision to
include one may preclude the inclusion of another.  Some subsets of at
most 256 characters are adequate to some particular tasks and others
are adequate to other particular tasks.  None is adequate to all such
tasks.

Do you accept that:

o UTF-8 is a variable length encoding scheme?

o UTF-8 has representations for all the million plus Unicode characters?

o The UTF-8 representation of any character is invariant with respect
  to any choice of specific language [pairs]?

Given these premises (which I accept) it does not occur that '[t]he
decision to include one [grapheme] may preclude the inclusion of
another.  There is no problem [...] of subset choice.

-- gil

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-10 Thread Charles Mills

Gil is 100% correct. 

And the assertion that the battle is over and UTF-8 has won is not my 
opinion. I don't have a dog in this fight. The world can go to 5-bit Baudot 
for all I care. It's simply a fact: 
http://w3techs.com/technologies/overview/character_encoding/all .

Charles

-Original Message-
From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On Behalf 
Of Paul Gilmartin
Sent: Friday, January 10, 2014 8:32 AM
To: IBM-MAIN@LISTSERV.UA.EDU
Subject: Re: Subject Unicode

On Fri, 10 Jan 2014 11:02:57 -0500, John Gilmore wrote:

Charles

I do not think you read my post at all carefully.

I made it clear that for specific language pairs UTF-8 is adequate if 
often clumsy.

For multiple-language environments it is equally clear that it is inadequate.

It is of course true that any grapheme, even say some company's logo or 
an astrological house, can be represented in UTF-8.  The problem is not 
one of representability but of subset choice.  The decision to include 
one may preclude the inclusion of another.  Some subsets of at most 256 
characters are adequate to some particular tasks and others are 
adequate to other particular tasks.  None is adequate to all such 
tasks.

Do you accept that:

o UTF-8 is a variable length encoding scheme?

o UTF-8 has representations for all the million plus Unicode characters?

o The UTF-8 representation of any character is invariant with respect
  to any choice of specific language [pairs]?

Given these premises (which I accept) it does not occur that '[t]he decision to 
include one [grapheme] may preclude the inclusion of another.  There is no 
problem [...] of subset choice.

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-10 Thread zMan

Cute. Notepad still exists in current Windows, btw.


On Fri, Jan 10, 2014 at 9:41 AM, Paul Gilmartin paulgboul...@aim.comwrote:

 On Fri, 10 Jan 2014 09:36:32 -0500, Harry Wahl wrote:
 
 ... Windows Notepad is particularly tricky because it adds them without
 you realizing it. So whether you look at a file with Notepad (or other
 simple editor) or don't can both affect your results and cause you to
 question your sanity because you didn't realize this.
 
 Notepad?  What's that?  Perhaps some obsolete predecessor of Wordpad?

 -- gil

 --
 For IBM-MAIN subscribe / signoff / archive access instructions,
 send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN




-- 
zMan -- I've got a mainframe and I'm not afraid to use it

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-10 Thread Steve Comstock


On 1/10/2014 10:28 AM, zMan wrote:

Cute. Notepad still exists in current Windows, btw.


And it handles utf-8 fine.

-Steve





On Fri, Jan 10, 2014 at 9:41 AM, Paul Gilmartin paulgboul...@aim.comwrote:


On Fri, 10 Jan 2014 09:36:32 -0500, Harry Wahl wrote:


... Windows Notepad is particularly tricky because it adds them without

you realizing it. So whether you look at a file with Notepad (or other
simple editor) or don't can both affect your results and cause you to
question your sanity because you didn't realize this.



Notepad?  What's that?  Perhaps some obsolete predecessor of Wordpad?

-- gil

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN







--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-10 Thread John Gilmore

Paul,

No, I do not accept the premises you set out.

I will try, when I have more time, to make clear why with examples.

Briefly, effective rules for encoding any 'character' recognized as a
Unicode one as a 'longer' UTF-8 one do not in general exist.
Moreover, even when they are available, my experience with them has
been bad.  In dealing recently with a document containing mixed
English, German, Korean and Japanese text I found that the UTF-8
version was 23% longer than the UTF-16 version.

John Gilmore, Ashland, MA 01721 - USA

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-10 Thread Charles Mills

You are mistaken. The rules for encoding a longer UTF-8 character are
well-defined. http://en.wikipedia.org/wiki/UTF-8#Description 

Yes, it is a fact that for files with mostly Asian and similar characters
UTF-8 is longer than UTF-16.

Charles

-Original Message-
From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On
Behalf Of John Gilmore
Sent: Friday, January 10, 2014 10:28 AM
To: IBM-MAIN@LISTSERV.UA.EDU
Subject: Re: Subject Unicode

Paul,

No, I do not accept the premises you set out.

I will try, when I have more time, to make clear why with examples.

Briefly, effective rules for encoding any 'character' recognized as a
Unicode one as a 'longer' UTF-8 one do not in general exist.
Moreover, even when they are available, my experience with them has been
bad.  In dealing recently with a document containing mixed English, German,
Korean and Japanese text I found that the UTF-8 version was 23% longer than
the UTF-16 version.

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-10 Thread Pew, Curtis G

On Jan 10, 2014, at 12:28 PM, John Gilmore jwgli...@gmail.com wrote:

 Briefly, effective rules for encoding any 'character' recognized as a
 Unicode one as a 'longer' UTF-8 one do not in general exist.

Sure they do. From http://www.unicode.org/faq/utf_bom.html#UTF8:

UTF-8 is the byte-oriented encoding form of Unicode. For details of its 
definition, see Section 2.5, Encoding Forms and Section 3.9, Unicode Encoding 
Forms ” in the Unicode Standard.”

Also, at http://www.unicode.org/resources/utf8.html:

• ANSI C implementation of UTF-8 
(http://www.bsdua.org/files/unicode.tar.gz)
Converts UTF-8 into UCS4 and vice versa. Source code is BSD licensed

 Moreover, even when they are available, my experience with them has
 been bad.  In dealing recently with a document containing mixed
 English, German, Korean and Japanese text I found that the UTF-8
 version was 23% longer than the UTF-16 version.

As far as I’ve been able to see, the Unicode consortium views UTF-8 and UTF-16 
as equally viable. Which is preferable depends entirely on the character of the 
texts you’re processing. (Well, with UTF-16 you have to worry about endianness 
but with UTF-8 you don’t.) If your text is mostly latin and related characters, 
UTF-8 will probably be shorter. If it includes a significant amount of CKJ 
(Chinese/Korean/Japanese) characters, as you apparently had here, UTF-16 will 
probably be shorter.

-- 
Curtis Pew (c@its.utexas.edu)
ITS Systems Core
The University of Texas at Austin

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-10 Thread Don Poitras

In article 8790842028980392.wa.paulgboulderaim@listserv.ua.edu you wrote:
 On Thu, 9 Jan 2014 22:44:19 -0500, Don Poitras wrote:

 As of z/OS 2.1, ISPF supports UTF-8, so a binary transfer will still show an 
 A if it
 was an A on the PC.  ...
 
 Does this support both UNIX and legacy files?  If the latter, does it
 require RECFM=V?  Using a variable-length character encoding in
 fixed length records seems pretty inconsistent.

Yes. No. The same issue was true of DBCS which they've supported
for years. I had a test case when I was converting CONDOR to use
DBCS that caused a PROG error under ISPF, but I made sure CONDOR
did something reasonable with it.

 How does it report invalid UTF-8 byte sequences?

It doesn't.

 Does it still automatically switch to CAPS ON?  What does it
 recognize as majuscule or minuscule with CAPS ON in Cyrillic
 characters, e.g.?

Actually, it looks as though they have a bug with this. If I
save a member with all caps, the next time I come in it says
CAPS on was turned off because I have lower case characters.
I don't have an emulator that will display or enter Cyrillic
characters.

 I don't know which emulators will show all the other bazillion
 glyphs though...
   
 Indeed.  Emulators?  What about hardware for the emulators to
 emulate?  Or does it require WSA?

I don't know which hardware will display all the glyphs either.
The only foreign 3270 I ever used was Korean. It had some funny
keys and you had to type in several for each DBCS character.

 What representation does it use in the 3270 data streams?  Is
 this well documented in the Data Streams reference?  What must
 it do to avoid embedded 3270 command bytes?  Is this compatible
 with Yale/7271/IND$FILE/Kermit conventions?

As far as 3270 goes, I think it's just going to us the CODEPAGE
and CHARSET you start ISPF with. I think it's going to be limited
to the set of EBCDIC code pages. As this is the first release, I'm
sure there's stuff missing that will be added as time goes by.

 -- gil

-- 
Don Poitras - SAS Development  -  SAS Institute Inc. - SAS Campus Drive
sas...@sas.com   (919) 531-5637Cary, NC 27513

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-10 Thread Tony Harminc

On 10 January 2014 13:28, John Gilmore jwgli...@gmail.com wrote:
 Briefly, effective rules for encoding any 'character' recognized as a
 Unicode one as a 'longer' UTF-8 one do not in general exist.

I am most puzzled to read this. UTF-8 is what Unicode calls a
transform format, and the conversion from other encodings of Unicode
characters is strictly (and simply) algorithmic, and by extension,
unambiguous. (In the early Unicode discussions in the 1990s, some
people whose native language was not English objected to the ambiguity
and even intranslatability of the English phrase transform format,
but despite that, the algorithmicity remains and is definitive.)

 Moreover, even when they are available, my experience with them has
 been bad.  In dealing recently with a document containing mixed
 English, German, Korean and Japanese text I found that the UTF-8
 version was 23% longer than the UTF-16 version.

That I don't doubt at all. Whether UTF-8 is a good format for storage,
transmission, or manipulation of Unicode characters surely varies by
context.

Tony H.

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-10 Thread John Gilmore

I am familiar with Unicode.  Wikipedia assertions of this or that
about it do not persuade me of much of anything.  Moreover, as a
review of the archives will show, I am an advocate of its use.

I have, however, found all of the UTF-8 implementations I have used
both unsatisfactory and unreliable in the literal sense that
conversions into UTF-8 from UTF-16 using them do not always yield the
same results.

If I have one, I suppose that English is my mother tongue; but, unlike
some of you, my preoccupations ane not exclusively or even
predominantly anglophone.  I am a polyglot.  There is no effective
appeal from my determination that a passage from Leopardi, say, is
mangled when it is converted/moved from UTF-16 to UTF-8

I have of course reported these anomalies to the appropriate Unicode bodies.

John Gilmore, Ashland, MA 01721 - USA

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-10 Thread Farley, Peter x23353

John, 

PMFJI here, but is it your position that because the *implementations* of 
Unicode character conversion routines (have been / are) flawed, that the 
*concept* of character conversions between UTF-16 and UTF-8 is useless?  From 
my admittedly limited knowledge and research about the UTF-8 and UTF-16 
character formats, ISTM that provably correct character-by-character conversion 
algorithms are and ought to be absolutely achievable.  Not *language* 
conversion mind you, only *character* conversion.  Language conversion is an 
entirely different kettle of fish.

I won't argue that such character conversion algorithms currently exist, of 
course.  I have not done sufficient research or experimentation to make that 
statement.

Peter

-Original Message-
From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On Behalf 
Of John Gilmore
Sent: Friday, January 10, 2014 4:10 PM
To: IBM-MAIN@LISTSERV.UA.EDU
Subject: Re: Subject Unicode

I am familiar with Unicode.  Wikipedia assertions of this or that
about it do not persuade me of much of anything.  Moreover, as a
review of the archives will show, I am an advocate of its use.

I have, however, found all of the UTF-8 implementations I have used
both unsatisfactory and unreliable in the literal sense that
conversions into UTF-8 from UTF-16 using them do not always yield the
same results.

If I have one, I suppose that English is my mother tongue; but, unlike
some of you, my preoccupations ane not exclusively or even
predominantly anglophone.  I am a polyglot.  There is no effective
appeal from my determination that a passage from Leopardi, say, is
mangled when it is converted/moved from UTF-16 to UTF-8

I have of course reported these anomalies to the appropriate Unicode bodies.

John Gilmore, Ashland, MA 01721 - USA

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

This message and any attachments are intended only for the use of the addressee 
and may contain information that is privileged and confidential. If the reader 
of the message is not the intended recipient or an authorized representative of 
the intended recipient, you are hereby notified that any dissemination of this 
communication is strictly prohibited. If you have received this communication 
in error, please notify us immediately by e-mail and delete the message and any 
attachments from your system.

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-10 Thread Pew, Curtis G

On Jan 10, 2014, at 3:10 PM, John Gilmore jwgli...@gmail.com wrote:

 I have, however, found all of the UTF-8 implementations I have used
 both unsatisfactory and unreliable in the literal sense that
 conversions into UTF-8 from UTF-16 using them do not always yield the
 same results.

Is the issue related to surrogate pairs? This is in the FAQ I linked to in my 
previous email:

Q: How do I convert a UTF-16 surrogate pair such as D800 DC00 to UTF-8? A one 
four byte sequence or as two separate 3-byte sequences?

A: The definition of UTF-8 requires that supplementary characters (those using 
surrogate pairs in UTF-16) be encoded with a single four byte sequence. 
However, there is a widespread practice of generating pairs of three byte 
sequences in older software, especially software which pre-dates the 
introduction of UTF-16 or that is interoperating with UTF-16 environments under 
particular constraints. Such an encoding is not conformant to UTF-8 as defined. 
See UTR #26: Compatability Encoding Scheme for UTF-16: 8-bit (CESU) for a 
formal description of such a non-UTF-8 data format. When using CESU-8, great 
care must be taken that data is not accidentally treated as if it was UTF-8, 
due to the similarity of the formats. [AF]


 
 If I have one, I suppose that English is my mother tongue; but, unlike
 some of you, my preoccupations ane not exclusively or even
 predominantly anglophone.  I am a polyglot.  There is no effective
 appeal from my determination that a passage from Leopardi, say, is
 mangled when it is converted/moved from UTF-16 to UTF-8

Then whatever converted it for you has a bug, because there is an isomorphic 
relationship between UTF-16 and UTF-8.

 
 I have of course reported these anomalies to the appropriate Unicode bodies.

Perhaps you should report it to whoever created your conversion software.

-- 
Curtis Pew (c@its.utexas.edu)
ITS Systems Core
The University of Texas at Austin

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-10 Thread John Gilmore

I have not been able to identify a defect in the scheme specified for
UTF-16 to UTF-8.

I have pointed to implementations that are sometimes unsuccessful, and
their failures have some common characteristics.

For now, I avoid UTF-8 when I can.  I expect that it will be
problem-free at some not at all remote time in the future.

I certainly was not prescient enough to think so ten years ago, but I
now a little regret the availability of UTF-8.   Its unsuitability for
use with non-alphabetic text or with  mixed  'alphabetic' and
non-alphabetic text, like written Japanese] has produced a sharp
difference in Eastern and Western Unicode usage patterns that is at
best unfortunate.

John Gilmore, Ashland, MA 01721 - USA

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-10 Thread Paul Gilmartin

On Fri, 10 Jan 2014 10:44:10 -0700, Steve Comstock wrote:

On 1/10/2014 10:28 AM, zMan wrote:
 Cute. Notepad still exists in current Windows, btw.

And it handles utf-8 fine.
 
SIGH
Notepad handles UTF-8 fine (on a scientific sample of 1).  But it's
utterly ignorant of UNIX line separators.

Wordpad handles UNIX line separators on input, but not on output.
I guess half is better than none.  But it's utterly ignorant of UTF-8.
/SIGH

Vim on both Ubuntu Linux and OS X seems to be UTF-8 clever, even
brilliant.  In a document containing both Latin and Cyrillic text, the
flip case command ('~') converts majuscule-minuscule for both,
both ways.

BTW, how can I convert majuscule-minuscule with ISPF EDIT.
I know; I could write a macro ...  Sheesh!

-- gil

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-10 Thread Steve Comstock


On 1/10/2014 3:52 PM, Paul Gilmartin wrote:

On Fri, 10 Jan 2014 10:44:10 -0700, Steve Comstock wrote:


On 1/10/2014 10:28 AM, zMan wrote:

Cute. Notepad still exists in current Windows, btw.


And it handles utf-8 fine.


SIGH
Notepad handles UTF-8 fine (on a scientific sample of 1).  But it's
utterly ignorant of UNIX line separators.

Wordpad handles UNIX line separators on input, but not on output.
I guess half is better than none.  But it's utterly ignorant of UTF-8.
/SIGH

Vim on both Ubuntu Linux and OS X seems to be UTF-8 clever, even
brilliant.  In a document containing both Latin and Cyrillic text, the
flip case command ('~') converts majuscule-minuscule for both,
both ways.

BTW, how can I convert majuscule-minuscule with ISPF EDIT.
I know; I could write a macro ...  Sheesh!


Well, on a command line:

  c p'' p'' all


Or, as a line command:

LCC
.
.
.
LCC

should do it.

-Steve




-- gil

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN



--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-10 Thread zMan

Coming in Windows 14: WordNote, which will handle UTF-8 *and* UNIX line
separators!!!


On Fri, Jan 10, 2014 at 5:52 PM, Paul Gilmartin paulgboul...@aim.comwrote:

 On Fri, 10 Jan 2014 10:44:10 -0700, Steve Comstock wrote:

 On 1/10/2014 10:28 AM, zMan wrote:
  Cute. Notepad still exists in current Windows, btw.
 
 And it handles utf-8 fine.
 
 SIGH
 Notepad handles UTF-8 fine (on a scientific sample of 1).  But it's
 utterly ignorant of UNIX line separators.

 Wordpad handles UNIX line separators on input, but not on output.
 I guess half is better than none.  But it's utterly ignorant of UTF-8.
 /SIGH

 Vim on both Ubuntu Linux and OS X seems to be UTF-8 clever, even
 brilliant.  In a document containing both Latin and Cyrillic text, the
 flip case command ('~') converts majuscule-minuscule for both,
 both ways.

 BTW, how can I convert majuscule-minuscule with ISPF EDIT.
 I know; I could write a macro ...  Sheesh!

 -- gil

 --
 For IBM-MAIN subscribe / signoff / archive access instructions,
 send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN




-- 
zMan -- I've got a mainframe and I'm not afraid to use it

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-09 Thread Paul Gilmartin

On Thu, 9 Jan 2014 16:35:55 -0800, Scott Ford wrote:

All:
�
I have a fundamental question on Unicode, or more of how it works . I am 
confused about the following scenario..
PC ( data using a foreign language Unicode page, like French )� going to z/OS 
and being keep in tact. Names and address type data. As the application do I 
have to query the incoming data and find out what the Unicode CECP is then 
translate to the desired ? or how does it work ?

I believe, yes.  What is the desired ?

iconv may be your friend here, either as a shell command or as a library 
subroutine,
after transferring the file in BINARY.

Will Co:Z let the user specify the target code page when transferring a file?

-- gil

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-09 Thread Scott Ford

Gil,

We send a data message from a pc, we encrypt it with AES128 , the message is 
received at the host (z/OS) decrypted then converted from ascii to ebcdic..so I 
am trying to figure out how to
Determine what codepage the pc uses and have z/OS convert it to the proper 
EBCDIC codepage from ASCII.  Does that help ?

Scott ford
www.identityforge.com
from my IPAD




 On Jan 9, 2014, at 7:47 PM, Paul Gilmartin paulgboul...@aim.com wrote:
 
 On Thu, 9 Jan 2014 16:35:55 -0800, Scott Ford wrote:
 
 All:
 �
 I have a fundamental question on Unicode, or more of how it works . I am 
 confused about the following scenario..
 PC ( data using a foreign language Unicode page, like French )� going to z/OS 
 and being keep in tact. Names and address type data. As the application do I 
 have to query the incoming data and find out what the Unicode CECP is then 
 translate to the desired ? or how does it work ?
 
 I believe, yes.  What is the desired ?
 
 iconv may be your friend here, either as a shell command or as a library 
 subroutine,
 after transferring the file in BINARY.
 
 Will Co:Z let the user specify the target code page when transferring a file?
 
 -- gil
 
 --
 For IBM-MAIN subscribe / signoff / archive access instructions,
 send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-09 Thread Charles Mills

There is no such thing as French Unicode. That is the uni part and the 
beauty of Unicode.

There are several flavors of Unicode, but they relate to how the code points 
are stored in a file or transmitted, not to the character set. All of Unicode 
is something like a million possible characters (someone will no doubt correct 
me with the exact number in use). Plain old ABC, French letters like ô, 
symbols like €, it's all there in one big Unicode. Every letter is always the 
same, whether you are in America or in France.

Now, how do you represent that in a file or whatever? Well, you could use 32 
bits for every character. Not very efficient, but certainly straightforward. 
That is called UTF-32. It's not very common.

You could use 16 bits for every character, with some sort of cleverness that 
yielded two 16-bit words when you had a code point bigger than 65535 (actually 
somewhat less due to how the cleverness works). That is called UTF-16. Pretty 
good but still not very efficient.

You could use 8 bits for most characters, with cleverness that expanded that 
out to two or three bytes for more obscure characters. Pretty efficient, and 
you could make the first part of the character set the same as ASCII, which 
would make it intuitive for PC folks who know that A is X'41'. That is called 
UTF-8, and it's pretty good and pretty popular as a result. Most Web pages are 
in UTF-8 and I believe this e-mail came to you in UTF-8.

Okay?

Now, define keep it intact. Do you mean bit for bit intact, or do you mean 
so that when I open it up in ISPF, what looked like an A on the PC now looks 
like an A in ISPF? If the former, you want a binary transfer, end of story. If 
the latter, you don't really want to keep it intact, you want to translate 
Unicode -- and you will need to know which flavor of Unicode encoding (not what 
country) -- to EBCDIC, which is what ISPF and most COBOL programs expect.

Comprende?

Charles

-Original Message-
From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On Behalf 
Of Scott Ford
Sent: Thursday, January 09, 2014 4:36 PM
To: IBM-MAIN@LISTSERV.UA.EDU
Subject: Subject Unicode

All:
 
I have a fundamental question on Unicode, or more of how it works . I am 
confused about the following scenario.. PC ( data using a foreign language 
Unicode page, like French )  going to z/OS and being keep in tact. Names and 
address type data. As the application do I have to query the incoming data and 
find out what the Unicode CECP is then translate to the desired ? or how does 
it work ?

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-09 Thread Sam Siegel

Scott - The PC is going to have to provide the codepage of the message data
someplace in the communication protocol.  Either as a separate field,
separate message or as a prefix/suffix to the message data.

It will be pretty dicey to attempt to guess the codepage based on the
message data.

One other possibility would be to provide a configuration file to the z/OS
side which says what codepage the PC is using.  Then the PC would need to
actually use the agreed upon codepage.

Sam


On Thu, Jan 9, 2014 at 5:39 PM, Scott Ford scott_j_f...@yahoo.com wrote:

 Gil,

 We send a data message from a pc, we encrypt it with AES128 , the message
 is received at the host (z/OS) decrypted then converted from ascii to
 ebcdic..so I am trying to figure out how to
 Determine what codepage the pc uses and have z/OS convert it to the proper
 EBCDIC codepage from ASCII.  Does that help ?

 Scott ford
 www.identityforge.com
 from my IPAD




  On Jan 9, 2014, at 7:47 PM, Paul Gilmartin paulgboul...@aim.com wrote:
 
  On Thu, 9 Jan 2014 16:35:55 -0800, Scott Ford wrote:
 
  All:
  �
  I have a fundamental question on Unicode, or more of how it works . I am
 confused about the following scenario..
  PC ( data using a foreign language Unicode page, like French )� going to
 z/OS and being keep in tact. Names and address type data. As the
 application do I have to query the incoming data and find out what the
 Unicode CECP is then translate to the desired ? or how does it work ?
 
  I believe, yes.  What is the desired ?
 
  iconv may be your friend here, either as a shell command or as a library
 subroutine,
  after transferring the file in BINARY.
 
  Will Co:Z let the user specify the target code page when transferring a
 file?
 
  -- gil
 
  --
  For IBM-MAIN subscribe / signoff / archive access instructions,
  send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

 --
 For IBM-MAIN subscribe / signoff / archive access instructions,
 send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-09 Thread Charles Mills

Wait. Unicode, or some ASCII variant like, say, a French 7-bit PC code page?

Other than with a lot of inferential cleverness, there is no way to look at an 
ASCII-like file and tell what the code page is. Think about it. The whole 
problem is that you use X'9B' on your PC to mean ¢ and a Frenchman uses it to 
mean something else. Your program sees an X'9B' in the file. What does it mean?

Someone is going to have to tell you the original code page.

Charles

-Original Message-
From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On Behalf 
Of Scott Ford
Sent: Thursday, January 09, 2014 5:39 PM
To: IBM-MAIN@LISTSERV.UA.EDU
Subject: Re: Subject Unicode

Gil,

We send a data message from a pc, we encrypt it with AES128 , the message is 
received at the host (z/OS) decrypted then converted from ascii to ebcdic..so I 
am trying to figure out how to Determine what codepage the pc uses and have 
z/OS convert it to the proper EBCDIC codepage from ASCII.  Does that help ?

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-09 Thread Tony Harminc

On 9 January 2014 20:39, Scott Ford scott_j_f...@yahoo.com wrote:
 We send a data message from a pc, we encrypt it with AES128 , the message is 
 received at the host (z/OS) decrypted then converted from ascii to ebcdic..so 
 I am trying to figure out how to
 Determine what codepage the pc uses and have z/OS convert it to the proper 
 EBCDIC codepage from ASCII.  Does that help ?

I'm not sure how your question relates to UNICODE. If the data on the
PC (Windows, I assume) is in some encoding of UNICODE, then code pages
don't really come into play. Any version of Windows can (in theory)
use any UNICODE character, regardless of the country or language of
installation. So there will be no difference between the way a US
English Windows box encodes, say the dollar sign character and the way
a French French one does. The UNICODE code point for dollar sign is
U+0024, and that's that. But you also mention ASCII, which (loosely)
is an 8-bit encoding. (Usually ASCII these days really means some
single-byte code page such as ISO 8859-n or one of the Windows ones
such as 1242.)

There is no general way to convert UNICODE into EBCDIC, because no IBM
EBCDIC code pages encodes all UNICODE characters. And if you are
talking about single byte EBCDIC code pages such as 037 or 1047, IBM's
generally encode 192 characters, vs tens of thousands in UNICODE.

If your PC data is in ASCII, i.e. single byte encoding, then you
have to both determine the code page in use on the PC, and the one in
use on your z/OS, and then use the appropriate mapping. Such a mapping
may not exist. For example, if your PC is using a Polish code page,
and your z/OS a Western European one such as 1047, there are
characters in each that just aren't in the other. Something will break
- generally someone's name will be misspelled, or worse.

Maybe you can give a short example of what data you have at each end,
and what you want to happen to it.

Tony H.

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-09 Thread Don Poitras

As of z/OS 2.1, ISPF supports UTF-8, so a binary transfer will still show an A 
if it
was an A on the PC. I don't know which emulators will show all the other 
bazillion
glyphs though...


In article 022c01cf0da5$a7b25180$f716f480$@mcn.org you wrote:
 There is no such thing as French Unicode. That is the uni part and the 
 beauty of Unicode.

 There are several flavors of Unicode, but they relate to how the code points 
 are stored in a file or transmitted, not to the character set. All of Unicode 
 is something like a million possible characters (someone will no doubt 
 correct me with the exact number in use). Plain old ABC, French letters 
 like ?, symbols like ?, it's all there in one big Unicode. Every letter is 
 always the same, whether you are in America or in France.

 Now, how do you represent that in a file or whatever? Well, you could use 32 
 bits for every character. Not very efficient, but certainly straightforward. 
 That is called UTF-32. It's not very common.

 You could use 16 bits for every character, with some sort of cleverness that 
 yielded two 16-bit words when you had a code point bigger than 65535 
 (actually somewhat less due to how the cleverness works). That is called 
 UTF-16. Pretty good but still not very efficient.

 You could use 8 bits for most characters, with cleverness that expanded that 
 out to two or three bytes for more obscure characters. Pretty efficient, and 
 you could make the first part of the character set the same as ASCII, which 
 would make it intuitive for PC folks who know that A is X'41'. That is 
 called UTF-8, and it's pretty good and pretty popular as a result. Most Web 
 pages are in UTF-8 and I believe this e-mail came to you in UTF-8.

 Okay?

 Now, define keep it intact. Do you mean bit for bit intact, or do you mean 
 so that when I open it up in ISPF, what looked like an A on the PC now looks 
 like an A in ISPF? If the former, you want a binary transfer, end of story. 
 If the latter, you don't really want to keep it intact, you want to translate 
 Unicode -- and you will need to know which flavor of Unicode encoding (not 
 what country) -- to EBCDIC, which is what ISPF and most COBOL programs expect.

 Comprende?

 Charles

 -Original Message-
 From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On 
 Behalf Of Scott Ford
 Sent: Thursday, January 09, 2014 4:36 PM
 To: IBM-MAIN@LISTSERV.UA.EDU
 Subject: Subject Unicode

 All:
  
 I have a fundamental question on Unicode, or more of how it works . I am 
 confused about the following scenario.. PC ( data using a foreign language 
 Unicode page, like French )  going to z/OS and being keep in tact. Names and 
 address type data. As the application do I have to query the incoming data 
 and find out what the Unicode CECP is then translate to the desired ? or how 
 does it work ?

-- 
Don Poitras - SAS Development  -  SAS Institute Inc. - SAS Campus Drive
sas...@sas.com   (919) 531-5637Cary, NC 27513

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

2014-01-09 Thread Timothy Sipples

Charles Mills writes:
You could use 16 bits for every character, with some sort of
cleverness that yielded two 16-bit words when you had a code
point bigger than 65535 (actually somewhat less due to how the
cleverness works). That is called UTF-16. Pretty good but
still not very efficient.

In Japan and China, to pick a couple examples, UTF-16 is rather efficient.
There are also far worse inefficiencies than using 16 bits to store each
Latin character. In short, I wouldn't get *too* hung up on this point,
especially as the complete lifecycle costs of storage continue to fall.

For example, if you're designing applications and information systems for a
global audience (or potentially global audience), it could be a perfectly
reasonable decision to standardize on UTF-16 in favor of potential
reductions in testing (for example). I think this is exactly what SAP did
around the time they introduced their ECC releases, for instance.

Somehow I'm reminded of the save two characters impulse which then caused
a lot of angst in preparing for Y2K. :-) If there's a reasonable argument
for spending 16 bits -- and sometimes there is -- by all means, spend them.
This isn't 1974 or even 1994. The vast majority of the world's data are not
codepoint-encoded alphanumerics anyway.


Timothy Sipples
GMU VCT Architect Executive (Based in Singapore)
E-Mail: sipp...@sg.ibm.com
--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

73 matches

Mail list logo