Re: Subject Unicode
In CAArMM9T5iAWomwY=mpt5lazdbz7xaz0h6b0nhyjws0ymc0o...@mail.gmail.com, on 01/13/2014 at 02:27 PM, Tony Harminc t...@harminc.net said: But no one would say that UTF-8 *is* ASCII, or that UTF-EBCDIC *is* EBCDIC. Well, all ASCII characters are valid single octet UTF-8 sequences, so I would say that ASCII is a subset of UTF-8.mAs for EBCDIC, there were already multiple EBCDIC code pages prior to Unicode, so there would seem to be a case for calling UTF-EBCDIC as much EBCDIC as the others. Does the IBM documentation take a position on that? -- Shmuel (Seymour J.) Metz, SysProg and JOAT ISO position; see http://patriot.net/~shmuel/resume/brief.html We don't care. We don't have to care, we're Congress. (S877: The Shut up and Eat Your spam act of 2003) -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
There *are* general ways to convert Unicode into EBCDIC. IBM z/OS Unicode Services implements several of them. Yes, a Unicode file potentially (but not necessarily) includes characters not found in a particular EBCDIC code page. Traditionally, they are converted to EBCDIC SUB, X'3F'. Assuming you refer to SBCS EBCDIC, the conversion results are likely to be unsatisfying if the Unicode file is, as is likely, rich in characters with no EBCDIC equivalent. OTOH EBCDIC DBCS includes a very large subset of common Unicode characters. Charles -Original Message- From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On Behalf Of Tony Harminc Sent: Monday, January 13, 2014 2:27 PM To: IBM-MAIN@LISTSERV.UA.EDU Subject: Re: Subject Unicode On 12 January 2014 10:21, Shmuel Metz (Seymour J.) shmuel+ibm-m...@patriot.net wrote: on 01/09/2014 at 09:00 PM, Tony Harminc t...@harminc.net said: There is no general way to convert UNICODE into EBCDIC, There are EBCDIC transforms for Unicode. I'm not sure whether that qulifies as EBCDIC. Exactly as much as UTF-8 qualifies as ASCII, that is to say, not at all. In both cases (UTF-8 and UTF-EBCDIC), there are several characteristics of the encoded result that are convenient in the respective environments. In particular, for legacy applications, the most often used characters in single-byte ASCII/EBCDIC are encoded by the same byte value in UTF-xxx. But no one would say that UTF-8 *is* ASCII, or that UTF-EBCDIC *is* EBCDIC. Tony H. -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode (Also email. Also TAB.)
In 7503442349556875.wa.paulgboulderaim@listserv.ua.edu, on 01/12/2014 at 09:55 AM, Paul Gilmartin paulgboul...@aim.com said: Thereby sacrificing some small economy of storage. There are even better arguments for deferring the disambiguation, such as: o Use of tabs as field separators in exported data bases. o Rendering in proportional-spaced fonts, particularly when the choice of font ls left to the viewer. o Use of HT to represent HT for applications that treat HT as an HT, e.g., EDIT, SCRIPT. -- Shmuel (Seymour J.) Metz, SysProg and JOAT ISO position; see http://patriot.net/~shmuel/resume/brief.html We don't care. We don't have to care, we're Congress. (S877: The Shut up and Eat Your spam act of 2003) -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode (Also email. Also TAB.)
In 52d2d540.1020...@t-online.de, on 01/12/2014 at 06:47 PM, Bernd Oppolzer bernd.oppol...@t-online.de said: IMO, the idea to put tab characters into files is wrong from the beginning. I don't agree; it's useful for text markup. I don't like taking away a printable character as a logical tab. -- Shmuel (Seymour J.) Metz, SysProg and JOAT ISO position; see http://patriot.net/~shmuel/resume/brief.html We don't care. We don't have to care, we're Congress. (S877: The Shut up and Eat Your spam act of 2003) -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
In 20140111220658.62ce18f...@panix3.panix.com, on 01/11/2014 at 05:06 PM, Don Poitras poit...@pobox.com said: I don't know how these characters are going to survive email, Not without proper[1] MIME header fields; characters like, e.g., Copyright (©), Euro (€), Registered (®), Yen (¥), are not ASCII. [1] E.g., Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 8bit -- Shmuel (Seymour J.) Metz, SysProg and JOAT ISO position; see http://patriot.net/~shmuel/resume/brief.html We don't care. We don't have to care, we're Congress. (S877: The Shut up and Eat Your spam act of 2003) -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
In 8160871980876269.wa.paulgboulderaim@listserv.ua.edu, on 01/12/2014 at 03:28 PM, Paul Gilmartin paulgboul...@aim.com said: Doesn't understand UNIX line breaks. I don't FTP text files as binary. NOTEPAD doesn't introduce fancy formatting that I didn't request and don't want. For me, that makes it supperior to WORDPAD. -- Shmuel (Seymour J.) Metz, SysProg and JOAT ISO position; see http://patriot.net/~shmuel/resume/brief.html We don't care. We don't have to care, we're Congress. (S877: The Shut up and Eat Your spam act of 2003) -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
In CAE1XxDF7qr2ek3mdCFRsgdqUjpReOCmCs5qqfckwMY7sh=t...@mail.gmail.com, on 01/12/2014 at 05:11 PM, John Gilmore jwgli...@gmail.com said: If I argued that the comments prefixed to a routine described its putative algorithm correctly and that the routine itself could thus contain no error, Shmuel would still hopefully be quick to point out the inadequacy of my argument; but here he is guilty of the same sort of cocksure silliness. Nonsense; you are conflating a formal specification with a body of code purporting to impliment that specification. If your real complaint is that there is code in the wild that does not correctly impliment the specifications, then be honest enough to say so instead of playing word games. -- Shmuel (Seymour J.) Metz, SysProg and JOAT ISO position; see http://patriot.net/~shmuel/resume/brief.html We don't care. We don't have to care, we're Congress. (S877: The Shut up and Eat Your spam act of 2003) -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
On Sun, Jan 12, 2014 at 5:28 PM, Paul Gilmartin paulgboul...@aim.comwrote: On Sun, 12 Jan 2014 15:48:49 -0600, Kirk Wolf wrote: On Linux gedit works fine, on Windows I use Notepad++ which handles Unix eols and UTF-8 You mean I don't have to wait for Windows 14!? Thanks! Does it do UNIX eols on in put *and* output? Wordpad only does the former. yes. You can switch in the current document, or you can set the default for new documents. -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode (Also email. Also TAB.)
On Sun, 12 Jan 2014 13:09:40 -0500, Shmuel Metz (Seymour J.) wrote: Thereby sacrificing some small economy of storage. There are even better arguments for deferring the disambiguation, such as: o Use of tabs as field separators in exported data bases. o Rendering in proportional-spaced fonts, particularly when the choice of font ls left to the viewer. o Use of HT to represent HT for applications that treat HT as an HT, e.g., EDIT, SCRIPT. A considerable refutation of the argument against retaining tabs in files. On Mon, 13 Jan 2014 07:51:33 -0500, Shmuel Metz (Seymour J.) wrote: [Notepad] Doesn't understand UNIX line breaks. I don't FTP text files as binary. NOTEPAD doesn't introduce fancy formatting that I didn't request and don't want. For me, that makes it supperior to WORDPAD. Rather than FTPing hither and yon, I share many of my files with NFS and Samba among UNIX, z/OS, and Windows. This argues for an eclectic editor. -- gil -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
On 12 January 2014 10:21, Shmuel Metz (Seymour J.) shmuel+ibm-m...@patriot.net wrote: on 01/09/2014 at 09:00 PM, Tony Harminc t...@harminc.net said: There is no general way to convert UNICODE into EBCDIC, There are EBCDIC transforms for Unicode. I'm not sure whether that qulifies as EBCDIC. Exactly as much as UTF-8 qualifies as ASCII, that is to say, not at all. In both cases (UTF-8 and UTF-EBCDIC), there are several characteristics of the encoded result that are convenient in the respective environments. In particular, for legacy applications, the most often used characters in single-byte ASCII/EBCDIC are encoded by the same byte value in UTF-xxx. But no one would say that UTF-8 *is* ASCII, or that UTF-EBCDIC *is* EBCDIC. Tony H. -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
On Mon, Jan 13, 2014 at 1:27 PM, Tony Harminc t...@harminc.net wrote: On 12 January 2014 10:21, Shmuel Metz (Seymour J.) shmuel+ibm-m...@patriot.net wrote: on 01/09/2014 at 09:00 PM, Tony Harminc t...@harminc.net said: There is no general way to convert UNICODE into EBCDIC, There are EBCDIC transforms for Unicode. I'm not sure whether that qulifies as EBCDIC. Exactly as much as UTF-8 qualifies as ASCII, that is to say, not at all. In both cases (UTF-8 and UTF-EBCDIC), there are several characteristics of the encoded result that are convenient in the respective environments. In particular, for legacy applications, the most often used characters in single-byte ASCII/EBCDIC are encoded by the same byte value in UTF-xxx. But no one would say that UTF-8 *is* ASCII, or that UTF-EBCDIC *is* EBCDIC. As a former US president famously said, it depends on what the meaning of the word 'is' is :-) It would be perfectly reasonable to say that UTF-8 is a superset of ASCII. That was its design - the lower 128 code points are ASCII (7-bits). Kirk Wolf Dovetailed Technologies http://dovetail.com -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
It might survive as .txt attachment. Everything else gets sliced and diced. In a message dated 1/11/2014 4:15:43 P.M. Central Standard Time, poit...@pobox.com writes: Yeah, I didn't think that would work. :) If you're reading this as I am, all the (well most of) text below ended up as ??. In -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode (Also email. Also TAB.)
On Sun, 12 Jan 2014 03:48:45 -0500, Ed Finnell wrote: It might survive as .txt attachment. Everything else gets sliced and diced. Depends on the MUA. The text I submitted earlier by email: == Polyglot == A common Russian phrase is ОЧЕНЬ ХОРОШО. The Greek might be ΠΟΛΥ ΚΑΛΑ. ... made the round trip intact. I believe it's also preserved by the web interface. We'll see now. On Sun, 12 Jan 2014 15:53:16 +0800, Timothy Sipples wrote: There's a tab symbol glyph at Unicode point U+21E5. It's a glyph consisting of a rightwards arrow to a bar. Many keyboards with a Tab key include this symbol as part of the key label. More information here: https://en.wikipedia.org/wiki/Arrow_(symbol) That might be RIGHTWARDS ARROW TO BAR ⇥. -- gil -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode (Also email. Also TAB.)
On the several keyboards I have at hand tab is modal, right or left depending upon the current shift-key setting. The modal marking appears to be | tab| | —— | | —— | in which the 'arrowheads' are solid, not open. I should think that '|' would be adequately perspicuous. The notorious ambiguity of tabs remains. Their effects depend upon local tab settings, and many implementations disambiguate them by replacing them with blanks of currently equivalent effect in saved/stored files. John Gilmore, Ashland, MA 01721 - USA -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode (Also email. Also TAB.)
On Sun, 12 Jan 2014 10:29:22 -0500, John Gilmore wrote: ... [Tabs'] effects depend upon local tab settings, and many implementations disambiguate them by replacing them with blanks of currently equivalent effect in saved/stored files. Thereby sacrificing some small economy of storage. There are even better arguments for deferring the disambiguation, such as: o Use of tabs as field separators in exported data bases. o Rendering in proportional-spaced fonts, particularly when the choice of font ls left to the viewer. -- gil -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode (Also email. Also TAB.)
IMO, the idea to put tab characters into files is wrong from the beginning. But of course it comes from the paper tape paradigma, where a file is historically a paper tape feeding a teletype machine. With normal local typewriters, a tab is nothing other than a command to the typewriter to point the carriage to a certain position, and that's how it should be implemented in more record oriented environments. That said, I would like it, if the editors I use replace all tabs to blanks when storing the files, and if there are never any tab characters inside the files, because, when reading them, you have the problem to decide what tab positions this file is meant to have, and you always have to guess, and it's wrong most of the time, and the result looks awful. Regards tabs separating fields in external representations of database records: there are other possibilities. Commas and semicolons are not nice, but it works, given proper handling of the text fields (and: if the text fields really contain text and no binary data). In total: I hate tabs and try to avoid them, wherever I can. Kind regards Bernd Am 12.01.2014 16:55, schrieb Paul Gilmartin: On Sun, 12 Jan 2014 10:29:22 -0500, John Gilmore wrote: ... [Tabs'] effects depend upon local tab settings, and many implementations disambiguate them by replacing them with blanks of currently equivalent effect in saved/stored files. Thereby sacrificing some small economy of storage. There are even better arguments for deferring the disambiguation, such as: o Use of tabs as field separators in exported data bases. o Rendering in proportional-spaced fonts, particularly when the choice of font ls left to the viewer. -- gil -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
In 1389314155.47172.yahoomail...@web126205.mail.ne1.yahoo.com, on 01/09/2014 at 04:35 PM, Scott Ford scott_j_f...@yahoo.com said: PC ( data using a foreign language Unicode page What are you trying to say? If the PC is using Unicode then it will transimit data as UTF-7 or UTF-8, which covers the entire BMP and beyond. Are you really asking about translating between ISO-8859 code pages and Unicode? -- Shmuel (Seymour J.) Metz, SysProg and JOAT ISO position; see http://patriot.net/~shmuel/resume/brief.html We don't care. We don't have to care, we're Congress. (S877: The Shut up and Eat Your spam act of 2003) -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
In 022c01cf0da5$a7b25180$f716f480$@mcn.org, on 01/09/2014 at 05:45 PM, Charles Mills charl...@mcn.org said: There are several flavors of Unicode, but they relate to how the code points are stored in a file or transmitted, not to the character set. Actually, those are transforms rather than different flavors of Unicode. Unicode does come in distinct numbered versions, but AFAIK a code point defined in an older version will always be present in the more recent versions. (someone will no doubt correct me with the exact number in use) That would be a moving target; Unicode does not currently assign all code points in the BMP, much less the full 20-bit range. and you could make the first part of the character set the same as ASCII, which would make it intuitive for PC folks who know that A is X'41'. That is called UTF-8, UTF-8 uses non-ASCII characters to represent code points higher that 127; UTF-7 uses only ASCII characters. I hpe he's not using UTF-7. -- Shmuel (Seymour J.) Metz, SysProg and JOAT ISO position; see http://patriot.net/~shmuel/resume/brief.html We don't care. We don't have to care, we're Congress. (S877: The Shut up and Eat Your spam act of 2003) -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
In 0e75a300-f7c5-46a7-a1d3-7189d2a58...@yahoo.com, on 01/09/2014 at 08:39 PM, Scott Ford scott_j_f...@yahoo.com said: We send a data message from a pc, we encrypt it with AES128 , the message is received at the host (z/OS) decrypted then converted from ascii to ebcdic If it really was ASCII then it would be cut and dried. If it's anything else then you need to know what it is in order to convert. -- Shmuel (Seymour J.) Metz, SysProg and JOAT ISO position; see http://patriot.net/~shmuel/resume/brief.html We don't care. We don't have to care, we're Congress. (S877: The Shut up and Eat Your spam act of 2003) -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
In 20140110034419.c71008f...@panix3.panix.com, on 01/09/2014 at 10:44 PM, Don Poitras poit...@pobox.com said: As of z/OS 2.1, ISPF supports UTF-8, so a binary transfer will still show an A if it was an A on the PC. Only if the PC was using UTF-8 or translates to Unicode with UTF-8 as part of the transmission. -- Shmuel (Seymour J.) Metz, SysProg and JOAT ISO position; see http://patriot.net/~shmuel/resume/brief.html We don't care. We don't have to care, we're Congress. (S877: The Shut up and Eat Your spam act of 2003) -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
In CAArMM9QOFq1jtzwmj=LWHTKkadMzx=aaqppbbqjnk+c8kuz...@mail.gmail.com, on 01/09/2014 at 09:00 PM, Tony Harminc t...@harminc.net said: There is no general way to convert UNICODE into EBCDIC, There are EBCDIC transforms for Unicode. I'm not sure whether that qulifies as EBCDIC. -- Shmuel (Seymour J.) Metz, SysProg and JOAT ISO position; see http://patriot.net/~shmuel/resume/brief.html We don't care. We don't have to care, we're Congress. (S877: The Shut up and Eat Your spam act of 2003) -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
In of065337e4.0e9ce2ec-on48257c5c.0027bfa8-48257c5c.00297...@sg.ibm.com, on 01/10/2014 at 03:30 PM, Timothy Sipples sipp...@sg.ibm.com said: Somehow I'm reminded of the save two characters impulse which then caused a lot of angst in preparing for Y2K. The situations are not comparable. With 2-digit years there was an actual truncation of the data. With UTF-7 or UTF-8, all of the data are still present, but there is an efficiency issue. -- Shmuel (Seymour J.) Metz, SysProg and JOAT ISO position; see http://patriot.net/~shmuel/resume/brief.html We don't care. We don't have to care, we're Congress. (S877: The Shut up and Eat Your spam act of 2003) -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
In bay177-w36d24c9a61f3b6464b4992f2...@phx.gbl, on 01/10/2014 at 09:36 AM, Harry Wahl harry_w...@hotmail.com said: You could use the BOM UTF characters There are none. U+FEFF ZERO WIDTH NO-BREAK SPACE is a Unicode character. usually inserted transparently at the beginning of a UTF file. Usually inserted *only* at the beginning of a file transmitted as UCS-2, UCS-4, UTF-16 or UTF-32. Also, see the restrictions in RFC 3629 (STD 63,) UTF-8, a transformation format of ISO 10646, 6. Byte order mark (BOM). -- Shmuel (Seymour J.) Metz, SysProg and JOAT ISO position; see http://patriot.net/~shmuel/resume/brief.html We don't care. We don't have to care, we're Congress. (S877: The Shut up and Eat Your spam act of 2003) -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
In 9931357931112854.wa.paulgboulderaim@listserv.ua.edu, on 01/10/2014 at 08:41 AM, Paul Gilmartin paulgboul...@aim.com said: Notepad? What's that? Perhaps some obsolete predecessor of Wordpad? No, it's a superior version of wordpad. HTH. -- Shmuel (Seymour J.) Metz, SysProg and JOAT ISO position; see http://patriot.net/~shmuel/resume/brief.html We don't care. We don't have to care, we're Congress. (S877: The Shut up and Eat Your spam act of 2003) -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
In CAE1XxDHn0wgwJpm+cNLSdzv=ccvoz1u5o6em7xwxnqs4u0z...@mail.gmail.com, on 01/10/2014 at 09:50 AM, John Gilmore jwgli...@gmail.com said: As soon, however, as you need to support o three or more different roman-alphabet natural languages, or o a roman-alphabet language and a non-alphabetic Asian language you need UTF-16. Nonsense; it's strictly an efficiency issue, and depends on the relative frequencies with which you use various characters and the degree to which you do locale-dependent functions. -- Shmuel (Seymour J.) Metz, SysProg and JOAT ISO position; see http://patriot.net/~shmuel/resume/brief.html We don't care. We don't have to care, we're Congress. (S877: The Shut up and Eat Your spam act of 2003) -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
In CAE1XxDHBJRiuFH7N-031xdJ3DUnO6QyG-=otde8fvnc-uyv...@mail.gmail.com, on 01/10/2014 at 11:02 AM, John Gilmore jwgli...@gmail.com said: The problem is not one of representability but of subset choice. There is no problem of subset choice, because use of UTF-8 does not imply a proper subset of Unicode; it is a tranform for all 2^20 minus[1] code points. [1] U+D800 through U+DFFF are not valid -- Shmuel (Seymour J.) Metz, SysProg and JOAT ISO position; see http://patriot.net/~shmuel/resume/brief.html We don't care. We don't have to care, we're Congress. (S877: The Shut up and Eat Your spam act of 2003) -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
In cae1xxdgwscr+ffp13_rperg4jmkferdgp4f6sxtz7v48o4g...@mail.gmail.com, on 01/10/2014 at 01:28 PM, John Gilmore jwgli...@gmail.com said: Briefly, effective rules for encoding any 'character' recognized as a Unicode one as a 'longer' UTF-8 one do not in general exist. What are you drinking? RFC 3629 spells them out in excruciating detail. In dealing recently with a document containing mixed English, German, Korean and Japanese text I found that the UTF-8 version was 23% longer than the UTF-16 version. That simply an efficiency issue; you need UTF-16 is a much strong claim than UTF-16 may be more efficient. Further, a sample size of one is grossly inadequate for drawing statistical conclusions. Try documents that are mostly English, French and German with a smattering of CJK languages and you will get different results. -- Shmuel (Seymour J.) Metz, SysProg and JOAT ISO position; see http://patriot.net/~shmuel/resume/brief.html We don't care. We don't have to care, we're Congress. (S877: The Shut up and Eat Your spam act of 2003) -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
In 20140110195944.3d5f333...@panix2.panix.com, on 01/10/2014 at 02:59 PM, Don Poitras poit...@pobox.com said: As far as 3270 goes, I think it's just going to us the CODEPAGE and CHARSET you start ISPF with. I think it's going to be limited to the set of EBCDIC code pages. As this is the first release, I'm sure there's stuff missing that will be added as time goes by. Proper support would require enhancing the 3270 display stream. -- Shmuel (Seymour J.) Metz, SysProg and JOAT ISO position; see http://patriot.net/~shmuel/resume/brief.html We don't care. We don't have to care, we're Congress. (S877: The Shut up and Eat Your spam act of 2003) -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode (Also email. Also TAB.)
Two short additions: first: Regards in the 4th paragraph is a sort of typo, should read Regarding second: from the moment on when we terminated to exchange files by paper tape, we should have stopped to put tabs into files from that same moment on - if not before. My opinion ... Kind regards Bernd Am 12.01.2014 18:47, schrieb Bernd Oppolzer: IMO, the idea to put tab characters into files is wrong from the beginning. But of course it comes from the paper tape paradigma, where a file is historically a paper tape feeding a teletype machine. With normal local typewriters, a tab is nothing other than a command to the typewriter to point the carriage to a certain position, and that's how it should be implemented in more record oriented environments. That said, I would like it, if the editors I use replace all tabs to blanks when storing the files, and if there are never any tab characters inside the files, because, when reading them, you have the problem to decide what tab positions this file is meant to have, and you always have to guess, and it's wrong most of the time, and the result looks awful. Regards tabs separating fields in external representations of database records: there are other possibilities. Commas and semicolons are not nice, but it works, given proper handling of the text fields (and: if the text fields really contain text and no binary data). In total: I hate tabs and try to avoid them, wherever I can. Kind regards Bernd Am 12.01.2014 16:55, schrieb Paul Gilmartin: On Sun, 12 Jan 2014 10:29:22 -0500, John Gilmore wrote: ... [Tabs'] effects depend upon local tab settings, and many implementations disambiguate them by replacing them with blanks of currently equivalent effect in saved/stored files. Thereby sacrificing some small economy of storage. There are even better arguments for deferring the disambiguation, such as: o Use of tabs as field separators in exported data bases. o Rendering in proportional-spaced fonts, particularly when the choice of font ls left to the viewer. -- gil -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode (Also email. Also TAB.)
you have the problem to decide what tab positions this file is meant to have, and you always have to guess, and it's wrong most of the time, and the result looks awful Your solution would also look awful with proportional text. - Ted MacNEIL eamacn...@yahoo.ca Twitter: @TedMacNEIL -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode (Also email. Also TAB.)
Am 12.01.2014 19:10, schrieb Ted MacNEIL: you have the problem to decide what tab positions this file is meant to have, and you always have to guess, and it's wrong most of the time, and the result looks awful Your solution would also look awful with proportional text. My focus is on source code most of the time; there I am lost with proportional fonts, anyway. -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode (Also email. Also TAB.)
On Sun, 12 Jan 2014 18:59:25 +0100, Bernd Oppolzer wrote: second: from the moment on when we terminated to exchange files by paper tape, we should have stopped to put tabs into files from that same moment on - if not before. My opinion ... Why? Where else would you keep them? Regards tabs separating fields in external representations of database records: there are other possibilities. Commas and semicolons are not nice, but it works, More seriously, if the data fields legitimately contain commas and/or semicolons, tab is a more useful field separator. If the data contain tabs? Well, that's a good place to apply your argument for avoiding tabs. Legibility? That goes directly back to the OP's question. -- gil -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
On Sun, 12 Jan 2014 10:45:23 -0500, Shmuel Metz (Seymour J.) wrote: Notepad? What's that? Perhaps some obsolete predecessor of Wordpad? No, it's a superior version of wordpad. HTH. Doesn't understand UNIX line breaks. For me that's a deal breaker. -- gil -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
On Linux gedit works fine, on Windows I use Notepad++ which handles Unix eols and UTF-8 Kirk Wolf Dovetailed Technologies http://dovetail.com On Sun, Jan 12, 2014 at 3:28 PM, Paul Gilmartin paulgboul...@aim.comwrote: On Sun, 12 Jan 2014 10:45:23 -0500, Shmuel Metz (Seymour J.) wrote: Notepad? What's that? Perhaps some obsolete predecessor of Wordpad? No, it's a superior version of wordpad. HTH. Doesn't understand UNIX line breaks. For me that's a deal breaker. -- gil -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode (Also email. Also TAB.)
Tabs are useful for formatting input text. I use tab settings of 10, 16. 35, and 72 for HLASM source formatting; but I will not use a text editor that does not---optionally for those who have other preferences---replace tabs with blanks during save/storage operations. Bernd and I are thus in complete agreement about this issue More generally, the notion of usurping the traditional function oif a character to use it for another purpose where it is safe to do so is, I think, a dubious, even irresponsible one. Doing so always has untoward, sometimes tragic consequences. I am sure, for example, that the people who 'extended' C on the cheap to support strings of conceptually unlimited length with EOS delimited by a nul, x'00' in an SBCS or x'' in a DBCS, thought their idea was benign. In the event they sowed badly, and we are all reaping the whirlwind. John Gilmore, Ashland, MA 01721 - USA -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
BTW, Notepad++ is not only free/open source, but it also has the goal of preventing Global Warming :-) http://notepad-plus-plus.org/ .. while at the same time likes to show off: http://notepad-plus-plus.org/features/column-mode-editing.html Kirk Wolf Dovetailed Technologies http://dovetail.com On Sun, Jan 12, 2014 at 3:48 PM, Kirk Wolf k...@dovetail.com wrote: On Linux gedit works fine, on Windows I use Notepad++ which handles Unix eols and UTF-8 Kirk Wolf Dovetailed Technologies http://dovetail.com On Sun, Jan 12, 2014 at 3:28 PM, Paul Gilmartin paulgboul...@aim.comwrote: On Sun, 12 Jan 2014 10:45:23 -0500, Shmuel Metz (Seymour J.) wrote: Notepad? What's that? Perhaps some obsolete predecessor of Wordpad? No, it's a superior version of wordpad. HTH. Doesn't understand UNIX line breaks. For me that's a deal breaker. -- gil -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
I don't generally respond to Shmuel's animadversions. This time, however, he has crossed the line of civilized behavior. His experience with Unicode appears to be limited to attentive reading of its defining documents. Its implementations are, unsurprisingly, imperfect. In particular the several UTF-8 implementations with which I am more familiar than I should wish to be are very imperfect indeed. If I argued that the comments prefixed to a routine described its putative algorithm correctly and that the routine itself could thus contain no error, Shmuel would still hopefully be quick to point out the inadequacy of my argument; but here he is guilty of the same sort of cocksure silliness. He wondered what I had been drinking. I wonder if he is not suffering from senile dementia. He is certainly exhibiting some of its characteristic symptoms. John Gilmore, Ashland, MA 01721 - USA -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
On Sun, 12 Jan 2014 15:48:49 -0600, Kirk Wolf wrote: On Linux gedit works fine, on Windows I use Notepad++ which handles Unix eols and UTF-8 You mean I don't have to wait for Windows 14!? Thanks! Does it do UNIX eols on in put *and* output? Wordpad only does the former. Thanks again, gil -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
Some words about editors, tabs, eolchar, eofchar. The editor which I like most does the following: - read files that have CRLF eols or LF eols - output files with CRLF or LF, controlled by an editor setting - output an EOF char, if desired (0x1a); most of the time, I don't want it - allow individual tab settings (of course), either by specifying the tab positions or by specifying an increment - translate tabs to spaces during reading, if desired - translate tabs to spaces during writing, if desired - translate spaces to tabs during writing, if desired (well, I wouldn't use that) - omit trailing blanks, if desired - or: set all records to a fixed length, controlled by editor option (filling with blanks or truncated, if necessary) Most (simple) editors don't have such features, but if you use the editor to prepare for example data files for input and the programs you have rely on such specific details of the input files, you are happy if you have an editor at hand that allows you to do the necessary modifications. Another remark regarding tabs in (source) files: they have no meaning to the compilers etc.; the compilers treat them like spaces in the best case. So the only reason for having them in the source is for formatting purposes, and that - as we pointed out already - does not work, because the tab settings at the time of the creation of the file are not known, so you will get garbage (for the human reader) in the general case. That's why we should IMO avoid tabs in source files. Sometimes I get C programs which I have to port to z/OS, for example; one of the first steps is: remove the tabs in the sources, restore the indentation of the source and limit the source line length to 72 or 80. At least, I invest some time to do this, if I plan to take the responsibility for those programs for a longer time and if I have to pass them through our normal change management and source archive systems. Kind regards Bernd Am 13.01.2014 00:28, schrieb Paul Gilmartin: On Sun, 12 Jan 2014 15:48:49 -0600, Kirk Wolf wrote: On Linux gedit works fine, on Windows I use Notepad++ which handles Unix eols and UTF-8 You mean I don't have to wait for Windows 14!? Thanks! Does it do UNIX eols on in put *and* output? Wordpad only does the former. Thanks again, gil -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
(Cross posting to ISPF-L and IBM-MAIN) On 2014-01-10, at 12:59, Don Poitras wrote: As of z/OS 2.1, ISPF supports UTF-8, so a binary transfer will still show an A if it was an A on the PC. ... What representation does it use in the 3270 data streams? Is this well documented in the Data Streams reference? What must it do to avoid embedded 3270 command bytes? Is this compatible with Yale/7271/IND$FILE/Kermit conventions? As far as 3270 goes, I think it's just going to us the CODEPAGE and CHARSET you start ISPF with. I think it's going to be limited to the set of EBCDIC code pages. As this is the first release, I'm sure there's stuff missing that will be added as time goes by. I guess that conforms to someone's notion of support. Should I understand that one can edit UTF-8 files; one just can't see most of the characters. I guess any meaningful editing must be done with macros. (I don't yet have access to 2.1.) What happens if I turn HEX ON? Will it show the value of the Unicode code point, or of the UTF-8 sequence of bytes. Generally, neither can be represented in two hex digits. On 2014-01-10, at 16:19, Steve Comstock wrote: BTW, how can I convert majuscule-minuscule with ISPF EDIT. I know; I could write a macro ... Sheesh! Well, on a command line: c p'' p'' all Or, as a line command: LCC ... LCC should do it. Thanks. I hadn't known about that. So if my UTF-8 file I have: == Polyglot == A common Russian phrase is ОЧЕНЬ ХОРОШО. The Greek might be ΠΟΛΥ ΚΑΛΑ. ... will those commands transform it to: == polyglot == a common russian phrase is очень хорошо. the greek might be πολυ καλα. ... even as Vim and LibreOffice do, and even if I can't see it? -- gil -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
In article e488911a-b303-4d2f-8cf9-247154ab8...@aim.com you wrote: (Cross posting to ISPF-L and IBM-MAIN) On 2014-01-10, at 12:59, Don Poitras wrote: As of z/OS 2.1, ISPF supports UTF-8, so a binary transfer will still show an A if it was an A on the PC. ... What representation does it use in the 3270 data streams? Is this well documented in the Data Streams reference? What must it do to avoid embedded 3270 command bytes? Is this compatible with Yale/7271/IND$FILE/Kermit conventions? As far as 3270 goes, I think it's just going to us the CODEPAGE and CHARSET you start ISPF with. I think it's going to be limited to the set of EBCDIC code pages. As this is the first release, I'm sure there's stuff missing that will be added as time goes by. I guess that conforms to someone's notion of support. Should I understand that one can edit UTF-8 files; one just can't see most of the characters. I guess any meaningful editing must be done with macros. (I don't yet have access to 2.1.) What happens if I turn HEX ON? Will it show the value of the Unicode code point, or of the UTF-8 sequence of bytes. Generally, neither can be represented in two hex digits. I don't know how these characters are going to survive email, so I'll describe what I did. Just editing all the hex from 00 to FF in EBCDIC mode, you end up with lots of glyphs that are two-byte in UTF-8. I copied one line using my emulator cut and paste and pasted the glyphs in a new member that I specified to be created using UTF-8. I then used the text split line command to put the first 5 glyphs each on a single line. The glyphs are: 1. logical not 2. pound (english money not weight) 3. Yen 4. Middle dot 5. Copyright I had to position the cursor on the correct hex byte to properly do the text-split. It's real easy to mess up the file. EDIT SASDTP.ISPF.CNTL(UTF8) - 01.02 Command === ** * Top of Data 01 ??] CACACACBCACACBCBCBCBC9CACA5CBC9222 2C2325272927262C2D2E3D282FD2437000 - 02 ?? CA 2C - 03 ?? CA 23 - 04 ?? CA 25 - 05 ?? CB 27 - 06 ?? CA 29 On 2014-01-10, at 16:19, Steve Comstock wrote: BTW, how can I convert majuscule-minuscule with ISPF EDIT. I know; I could write a macro ... Sheesh! Well, on a command line: c p'' p'' all Or, as a line command: LCC ... LCC should do it. Thanks. I hadn't known about that. So if my UTF-8 file I have: == Polyglot == A common Russian phrase is ? ??. The Greek might be . ... will those commands transform it to: == polyglot == a common russian phrase is ? ??. the greek might be . ... even as Vim and LibreOffice do, and even if I can't see it? -- gil -- Don Poitras - SAS Development - SAS Institute Inc. - SAS Campus Drive sas...@sas.com (919) 531-5637Cary, NC 27513 -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
Yeah, I didn't think that would work. :) If you're reading this as I am, all the (well most of) text below ended up as ??. In actuality, every ?? was a single width. The first line contains 16 characters with 32 hex bytes underneath. The subsequent lines are all a single character with 2 hex bytes shown. You can type in 3-byte UTF-8 codes, but they won't show anything in the text fields. I don't know how these characters are going to survive email, so I'll describe what I did. Just editing all the hex from 00 to FF in EBCDIC mode, you end up with lots of glyphs that are two-byte in UTF-8. I copied one line using my emulator cut and paste and pasted the glyphs in a new member that I specified to be created using UTF-8. I then used the text split line command to put the first 5 glyphs each on a single line. The glyphs are: 1. logical not 2. pound (english money not weight) 3. Yen 4. Middle dot 5. Copyright I had to position the cursor on the correct hex byte to properly do the text-split. It's real easy to mess up the file. EDIT SASDTP.ISPF.CNTL(UTF8) - 01.02 Command === ** * Top of Data 01 ??] CACACACBCACACBCBCBCBC9CACA5CBC9222 2C2325272927262C2D2E3D282FD2437000 - 02 ?? CA 2C - 03 ?? CA 23 - 04 ?? CA 25 - 05 ?? CB 27 - 06 ?? CA 29 -- Don Poitras - SAS Development - SAS Institute Inc. - SAS Campus Drive sas...@sas.com (919) 531-5637Cary, NC 27513 -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
Other than with a lot of inferential cleverness, there is no way to look at an ASCII-like file and tell what the code page is. The same applies to data encoded in EBCDIC. In fact, files are nothing but a series of bytes. You always need to know what those byes represent in order to be able work on the in a meaningful way. Especially in the distributed world, some conventions have been established that help programs in guessing what the file content might be. The first couple of bytes contain a certain byte sequence to identify the type of the file. But still, there is no guarantee the rest of the file matches that indication. Unfortunately, no such convention exists for pure text data. Neither a convention to indicate this is text nor to tell the encoding / code page used. -- Peter Hunkeler -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
You could use the BOM UTF characters to determine whether a file is UTF or not, and what form of UTF (UTF-8, UTF-16, UTF-32, big-edian or little-edian) is being used. The BOM characters are the UTF defined characters usually inserted transparently at the beginning of a UTF file. Granted this is not a perfect answer, but it may help for want of any other way to determine if a file is UTF or not. However, BOM characters are not always present, some platforms always have them (Microsoft) and some platforms eschew them. Windows Notepad is particularly tricky because it adds them without you realizing it. So whether you look at a file with Notepad (or other simple editor) or don't can both affect your results and cause you to question your sanity because you didn't realize this. BOM characters can be very useful. For example an XML header defines character encoding, but BOM characters can be used to determine the character encoding of the XML header itself. For UTF-8 the BOM character can be used to determine if a file is UTF encoded or not. But, for UTF-16 and UTF-32, it also allows you to determine the edianness of the UTF code units. Harry Date: Fri, 10 Jan 2014 08:01:42 + From: peter.hunke...@credit-suisse.com Subject: Re: Subject Unicode To: IBM-MAIN@LISTSERV.UA.EDU Other than with a lot of inferential cleverness, there is no way to look at an ASCII-like file and tell what the code page is. The same applies to data encoded in EBCDIC. In fact, files are nothing but a series of bytes. You always need to know what those byes represent in order to be able work on the in a meaningful way. Especially in the distributed world, some conventions have been established that help programs in guessing what the file content might be. The first couple of bytes contain a certain byte sequence to identify the type of the file. But still, there is no guarantee the rest of the file matches that indication. Unfortunately, no such convention exists for pure text data. Neither a convention to indicate this is text nor to tell the encoding / code page used. -- Peter Hunkeler -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
John, if you are saying that there are some Unicode characters that cannot be represented in UTF-8 then that is incorrect. *Any* Unicode character -- pretty much any character in the world -- may be represented in UTF-8. For external representations of Unicode the battle is pretty much over and UTF-8 won. Charles -Original Message- From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On Behalf Of John Gilmore Sent: Friday, January 10, 2014 6:51 AM To: IBM-MAIN@LISTSERV.UA.EDU Subject: Re: Subject Unicode I have refrained from saying anything about this topic because I judged that anything I said would be predictable. I am a well-known offender, a flagrant Unicode, i.e., minimally UTF-16, advocate. Now, however, Charles Mills has pushed me into posting something. He writes begin extract That is called UTF-16. Pretty good but still not very efficient. /end extract As usual, it depends. If one's problems are always with a single pair of natural languages, one of which is English (ENG or ENU), which makes little use of orthographically marked letters, a satisfactory UTF-8 'solution' may be, indeed usually is, possible. Something can, that is, be done in a UTF-8 framework with such languiage pairs as o English and French. o English and German, or even o English and Polish. As soon, however, as you need to support o three or more different roman-alphabet natural languages, or o a roman-alphabet language and a non-alphabetic Asian language you need UTF-16. To put the matter more brutally, any new system being built today and in particular any new system that is likely to interact, at whatever remove, with web-based systems should use UTF-16. The notion that the only efficient representation for character data is an SBCS one is retrograde at best. Continuing with it will make trouble for those who do so; worse, it will ensure that the systems they build are short-lived. The ASCII vs EBCDIC dispute is no longer of much interest. They are both obsolescent, usable safely only in what the international lawyers call municipal contexts. -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
Fair enough. I was answering a question about French Unicode at five o'clock. I certainly don't mean to get hung up on efficiency and yes, for certain character distributions, UTF-16 yields a shorter file or message length than UTF-8. Charles -Original Message- From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On Behalf Of Timothy Sipples Sent: Thursday, January 09, 2014 11:31 PM To: IBM-MAIN@LISTSERV.UA.EDU Subject: Re: Subject Unicode Charles Mills writes: You could use 16 bits for every character, with some sort of cleverness that yielded two 16-bit words when you had a code point bigger than 65535 (actually somewhat less due to how the cleverness works). That is called UTF-16. Pretty good but still not very efficient. In Japan and China, to pick a couple examples, UTF-16 is rather efficient. There are also far worse inefficiencies than using 16 bits to store each Latin character. In short, I wouldn't get *too* hung up on this point, especially as the complete lifecycle costs of storage continue to fall. -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
Gil: Co:Z SFTP and DatasetPipes both support any single-byte encoding as well as UTF-8 when converting to/from datasets. You can use either iconv or unicode system services, including custom tables and techniques. Scott: What is a foreign language Unicode page? Can you give a specific example? Kirk Wolf Dovetailed Technologies http://dovetail.com On Thu, Jan 9, 2014 at 6:47 PM, Paul Gilmartin paulgboul...@aim.com wrote: On Thu, 9 Jan 2014 16:35:55 -0800, Scott Ford wrote: All: � I have a fundamental question on Unicode, or more of how it works . I am confused about the following scenario.. PC ( data using a foreign language Unicode page, like French )� going to z/OS and being keep in tact. Names and address type data. As the application do I have to query the incoming data and find out what the Unicode CECP is then translate to the desired ? or how does it work ? I believe, yes. What is the desired ? iconv may be your friend here, either as a shell command or as a library subroutine, after transferring the file in BINARY. Will Co:Z let the user specify the target code page when transferring a file? -- gil -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
historical reference 1960-1979 http://www.bobbemer.com/REGISTRY.HTM ibm major driver behind all this http://www.bobbemer.com/ZACHERLY.HTM however, Learson had problem and made decision to temporarily go with EBCDIDC w/o realizing what he had done (The Biggest Computer Goof Ever) ... and the company got stuck with it http://www.bobbemer.com/P-BIT.HTM lots of other history http://www.bobbemer.com/HISTORY.HTM -- virtualization experience starting Jan1968, online at home since Mar1970 -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
Charles I do not think you read my post at all carefully. I made it clear that for specific language pairs UTF-8 is adequate if often clumsy. For multiple-language environments it is equally clear that it is inadequate. It is of course true that any grapheme, even say some company's logo or an astrological house, can be represented in UTF-8. The problem is not one of representability but of subset choice. The decision to include one may preclude the inclusion of another. Some subsets of at most 256 characters are adequate to some particular tasks and others are adequate to other particular tasks. None is adequate to all such tasks. Moreover, in my now considerable controversial experience I have noted that people who assert that 1) the real meaning of some word is what they want it to be or 2) that a battle is pretty much over and their side has won are are arguing hopefully, trying to convince others, not recording the judgment of history. John Gilmore, Ashland, MA 01721 - USA -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
On Fri, 10 Jan 2014 11:02:57 -0500, John Gilmore wrote: Charles I do not think you read my post at all carefully. I made it clear that for specific language pairs UTF-8 is adequate if often clumsy. For multiple-language environments it is equally clear that it is inadequate. It is of course true that any grapheme, even say some company's logo or an astrological house, can be represented in UTF-8. The problem is not one of representability but of subset choice. The decision to include one may preclude the inclusion of another. Some subsets of at most 256 characters are adequate to some particular tasks and others are adequate to other particular tasks. None is adequate to all such tasks. Do you accept that: o UTF-8 is a variable length encoding scheme? o UTF-8 has representations for all the million plus Unicode characters? o The UTF-8 representation of any character is invariant with respect to any choice of specific language [pairs]? Given these premises (which I accept) it does not occur that '[t]he decision to include one [grapheme] may preclude the inclusion of another. There is no problem [...] of subset choice. -- gil -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
Gil is 100% correct. And the assertion that the battle is over and UTF-8 has won is not my opinion. I don't have a dog in this fight. The world can go to 5-bit Baudot for all I care. It's simply a fact: http://w3techs.com/technologies/overview/character_encoding/all . Charles -Original Message- From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On Behalf Of Paul Gilmartin Sent: Friday, January 10, 2014 8:32 AM To: IBM-MAIN@LISTSERV.UA.EDU Subject: Re: Subject Unicode On Fri, 10 Jan 2014 11:02:57 -0500, John Gilmore wrote: Charles I do not think you read my post at all carefully. I made it clear that for specific language pairs UTF-8 is adequate if often clumsy. For multiple-language environments it is equally clear that it is inadequate. It is of course true that any grapheme, even say some company's logo or an astrological house, can be represented in UTF-8. The problem is not one of representability but of subset choice. The decision to include one may preclude the inclusion of another. Some subsets of at most 256 characters are adequate to some particular tasks and others are adequate to other particular tasks. None is adequate to all such tasks. Do you accept that: o UTF-8 is a variable length encoding scheme? o UTF-8 has representations for all the million plus Unicode characters? o The UTF-8 representation of any character is invariant with respect to any choice of specific language [pairs]? Given these premises (which I accept) it does not occur that '[t]he decision to include one [grapheme] may preclude the inclusion of another. There is no problem [...] of subset choice. -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
Cute. Notepad still exists in current Windows, btw. On Fri, Jan 10, 2014 at 9:41 AM, Paul Gilmartin paulgboul...@aim.comwrote: On Fri, 10 Jan 2014 09:36:32 -0500, Harry Wahl wrote: ... Windows Notepad is particularly tricky because it adds them without you realizing it. So whether you look at a file with Notepad (or other simple editor) or don't can both affect your results and cause you to question your sanity because you didn't realize this. Notepad? What's that? Perhaps some obsolete predecessor of Wordpad? -- gil -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN -- zMan -- I've got a mainframe and I'm not afraid to use it -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
On 1/10/2014 10:28 AM, zMan wrote: Cute. Notepad still exists in current Windows, btw. And it handles utf-8 fine. -Steve On Fri, Jan 10, 2014 at 9:41 AM, Paul Gilmartin paulgboul...@aim.comwrote: On Fri, 10 Jan 2014 09:36:32 -0500, Harry Wahl wrote: ... Windows Notepad is particularly tricky because it adds them without you realizing it. So whether you look at a file with Notepad (or other simple editor) or don't can both affect your results and cause you to question your sanity because you didn't realize this. Notepad? What's that? Perhaps some obsolete predecessor of Wordpad? -- gil -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
Paul, No, I do not accept the premises you set out. I will try, when I have more time, to make clear why with examples. Briefly, effective rules for encoding any 'character' recognized as a Unicode one as a 'longer' UTF-8 one do not in general exist. Moreover, even when they are available, my experience with them has been bad. In dealing recently with a document containing mixed English, German, Korean and Japanese text I found that the UTF-8 version was 23% longer than the UTF-16 version. John Gilmore, Ashland, MA 01721 - USA -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
You are mistaken. The rules for encoding a longer UTF-8 character are well-defined. http://en.wikipedia.org/wiki/UTF-8#Description Yes, it is a fact that for files with mostly Asian and similar characters UTF-8 is longer than UTF-16. Charles -Original Message- From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On Behalf Of John Gilmore Sent: Friday, January 10, 2014 10:28 AM To: IBM-MAIN@LISTSERV.UA.EDU Subject: Re: Subject Unicode Paul, No, I do not accept the premises you set out. I will try, when I have more time, to make clear why with examples. Briefly, effective rules for encoding any 'character' recognized as a Unicode one as a 'longer' UTF-8 one do not in general exist. Moreover, even when they are available, my experience with them has been bad. In dealing recently with a document containing mixed English, German, Korean and Japanese text I found that the UTF-8 version was 23% longer than the UTF-16 version. -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
On Jan 10, 2014, at 12:28 PM, John Gilmore jwgli...@gmail.com wrote: Briefly, effective rules for encoding any 'character' recognized as a Unicode one as a 'longer' UTF-8 one do not in general exist. Sure they do. From http://www.unicode.org/faq/utf_bom.html#UTF8: UTF-8 is the byte-oriented encoding form of Unicode. For details of its definition, see Section 2.5, Encoding Forms and Section 3.9, Unicode Encoding Forms ” in the Unicode Standard.” Also, at http://www.unicode.org/resources/utf8.html: • ANSI C implementation of UTF-8 (http://www.bsdua.org/files/unicode.tar.gz) Converts UTF-8 into UCS4 and vice versa. Source code is BSD licensed Moreover, even when they are available, my experience with them has been bad. In dealing recently with a document containing mixed English, German, Korean and Japanese text I found that the UTF-8 version was 23% longer than the UTF-16 version. As far as I’ve been able to see, the Unicode consortium views UTF-8 and UTF-16 as equally viable. Which is preferable depends entirely on the character of the texts you’re processing. (Well, with UTF-16 you have to worry about endianness but with UTF-8 you don’t.) If your text is mostly latin and related characters, UTF-8 will probably be shorter. If it includes a significant amount of CKJ (Chinese/Korean/Japanese) characters, as you apparently had here, UTF-16 will probably be shorter. -- Curtis Pew (c@its.utexas.edu) ITS Systems Core The University of Texas at Austin -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
In article 8790842028980392.wa.paulgboulderaim@listserv.ua.edu you wrote: On Thu, 9 Jan 2014 22:44:19 -0500, Don Poitras wrote: As of z/OS 2.1, ISPF supports UTF-8, so a binary transfer will still show an A if it was an A on the PC. ... Does this support both UNIX and legacy files? If the latter, does it require RECFM=V? Using a variable-length character encoding in fixed length records seems pretty inconsistent. Yes. No. The same issue was true of DBCS which they've supported for years. I had a test case when I was converting CONDOR to use DBCS that caused a PROG error under ISPF, but I made sure CONDOR did something reasonable with it. How does it report invalid UTF-8 byte sequences? It doesn't. Does it still automatically switch to CAPS ON? What does it recognize as majuscule or minuscule with CAPS ON in Cyrillic characters, e.g.? Actually, it looks as though they have a bug with this. If I save a member with all caps, the next time I come in it says CAPS on was turned off because I have lower case characters. I don't have an emulator that will display or enter Cyrillic characters. I don't know which emulators will show all the other bazillion glyphs though... Indeed. Emulators? What about hardware for the emulators to emulate? Or does it require WSA? I don't know which hardware will display all the glyphs either. The only foreign 3270 I ever used was Korean. It had some funny keys and you had to type in several for each DBCS character. What representation does it use in the 3270 data streams? Is this well documented in the Data Streams reference? What must it do to avoid embedded 3270 command bytes? Is this compatible with Yale/7271/IND$FILE/Kermit conventions? As far as 3270 goes, I think it's just going to us the CODEPAGE and CHARSET you start ISPF with. I think it's going to be limited to the set of EBCDIC code pages. As this is the first release, I'm sure there's stuff missing that will be added as time goes by. -- gil -- Don Poitras - SAS Development - SAS Institute Inc. - SAS Campus Drive sas...@sas.com (919) 531-5637Cary, NC 27513 -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
On 10 January 2014 13:28, John Gilmore jwgli...@gmail.com wrote: Briefly, effective rules for encoding any 'character' recognized as a Unicode one as a 'longer' UTF-8 one do not in general exist. I am most puzzled to read this. UTF-8 is what Unicode calls a transform format, and the conversion from other encodings of Unicode characters is strictly (and simply) algorithmic, and by extension, unambiguous. (In the early Unicode discussions in the 1990s, some people whose native language was not English objected to the ambiguity and even intranslatability of the English phrase transform format, but despite that, the algorithmicity remains and is definitive.) Moreover, even when they are available, my experience with them has been bad. In dealing recently with a document containing mixed English, German, Korean and Japanese text I found that the UTF-8 version was 23% longer than the UTF-16 version. That I don't doubt at all. Whether UTF-8 is a good format for storage, transmission, or manipulation of Unicode characters surely varies by context. Tony H. -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
I am familiar with Unicode. Wikipedia assertions of this or that about it do not persuade me of much of anything. Moreover, as a review of the archives will show, I am an advocate of its use. I have, however, found all of the UTF-8 implementations I have used both unsatisfactory and unreliable in the literal sense that conversions into UTF-8 from UTF-16 using them do not always yield the same results. If I have one, I suppose that English is my mother tongue; but, unlike some of you, my preoccupations ane not exclusively or even predominantly anglophone. I am a polyglot. There is no effective appeal from my determination that a passage from Leopardi, say, is mangled when it is converted/moved from UTF-16 to UTF-8 I have of course reported these anomalies to the appropriate Unicode bodies. John Gilmore, Ashland, MA 01721 - USA -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
John, PMFJI here, but is it your position that because the *implementations* of Unicode character conversion routines (have been / are) flawed, that the *concept* of character conversions between UTF-16 and UTF-8 is useless? From my admittedly limited knowledge and research about the UTF-8 and UTF-16 character formats, ISTM that provably correct character-by-character conversion algorithms are and ought to be absolutely achievable. Not *language* conversion mind you, only *character* conversion. Language conversion is an entirely different kettle of fish. I won't argue that such character conversion algorithms currently exist, of course. I have not done sufficient research or experimentation to make that statement. Peter -Original Message- From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On Behalf Of John Gilmore Sent: Friday, January 10, 2014 4:10 PM To: IBM-MAIN@LISTSERV.UA.EDU Subject: Re: Subject Unicode I am familiar with Unicode. Wikipedia assertions of this or that about it do not persuade me of much of anything. Moreover, as a review of the archives will show, I am an advocate of its use. I have, however, found all of the UTF-8 implementations I have used both unsatisfactory and unreliable in the literal sense that conversions into UTF-8 from UTF-16 using them do not always yield the same results. If I have one, I suppose that English is my mother tongue; but, unlike some of you, my preoccupations ane not exclusively or even predominantly anglophone. I am a polyglot. There is no effective appeal from my determination that a passage from Leopardi, say, is mangled when it is converted/moved from UTF-16 to UTF-8 I have of course reported these anomalies to the appropriate Unicode bodies. John Gilmore, Ashland, MA 01721 - USA -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN This message and any attachments are intended only for the use of the addressee and may contain information that is privileged and confidential. If the reader of the message is not the intended recipient or an authorized representative of the intended recipient, you are hereby notified that any dissemination of this communication is strictly prohibited. If you have received this communication in error, please notify us immediately by e-mail and delete the message and any attachments from your system. -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
On Jan 10, 2014, at 3:10 PM, John Gilmore jwgli...@gmail.com wrote: I have, however, found all of the UTF-8 implementations I have used both unsatisfactory and unreliable in the literal sense that conversions into UTF-8 from UTF-16 using them do not always yield the same results. Is the issue related to surrogate pairs? This is in the FAQ I linked to in my previous email: Q: How do I convert a UTF-16 surrogate pair such as D800 DC00 to UTF-8? A one four byte sequence or as two separate 3-byte sequences? A: The definition of UTF-8 requires that supplementary characters (those using surrogate pairs in UTF-16) be encoded with a single four byte sequence. However, there is a widespread practice of generating pairs of three byte sequences in older software, especially software which pre-dates the introduction of UTF-16 or that is interoperating with UTF-16 environments under particular constraints. Such an encoding is not conformant to UTF-8 as defined. See UTR #26: Compatability Encoding Scheme for UTF-16: 8-bit (CESU) for a formal description of such a non-UTF-8 data format. When using CESU-8, great care must be taken that data is not accidentally treated as if it was UTF-8, due to the similarity of the formats. [AF] If I have one, I suppose that English is my mother tongue; but, unlike some of you, my preoccupations ane not exclusively or even predominantly anglophone. I am a polyglot. There is no effective appeal from my determination that a passage from Leopardi, say, is mangled when it is converted/moved from UTF-16 to UTF-8 Then whatever converted it for you has a bug, because there is an isomorphic relationship between UTF-16 and UTF-8. I have of course reported these anomalies to the appropriate Unicode bodies. Perhaps you should report it to whoever created your conversion software. -- Curtis Pew (c@its.utexas.edu) ITS Systems Core The University of Texas at Austin -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
I have not been able to identify a defect in the scheme specified for UTF-16 to UTF-8. I have pointed to implementations that are sometimes unsuccessful, and their failures have some common characteristics. For now, I avoid UTF-8 when I can. I expect that it will be problem-free at some not at all remote time in the future. I certainly was not prescient enough to think so ten years ago, but I now a little regret the availability of UTF-8. Its unsuitability for use with non-alphabetic text or with mixed 'alphabetic' and non-alphabetic text, like written Japanese] has produced a sharp difference in Eastern and Western Unicode usage patterns that is at best unfortunate. John Gilmore, Ashland, MA 01721 - USA -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
On Fri, 10 Jan 2014 10:44:10 -0700, Steve Comstock wrote: On 1/10/2014 10:28 AM, zMan wrote: Cute. Notepad still exists in current Windows, btw. And it handles utf-8 fine. SIGH Notepad handles UTF-8 fine (on a scientific sample of 1). But it's utterly ignorant of UNIX line separators. Wordpad handles UNIX line separators on input, but not on output. I guess half is better than none. But it's utterly ignorant of UTF-8. /SIGH Vim on both Ubuntu Linux and OS X seems to be UTF-8 clever, even brilliant. In a document containing both Latin and Cyrillic text, the flip case command ('~') converts majuscule-minuscule for both, both ways. BTW, how can I convert majuscule-minuscule with ISPF EDIT. I know; I could write a macro ... Sheesh! -- gil -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
On 1/10/2014 3:52 PM, Paul Gilmartin wrote: On Fri, 10 Jan 2014 10:44:10 -0700, Steve Comstock wrote: On 1/10/2014 10:28 AM, zMan wrote: Cute. Notepad still exists in current Windows, btw. And it handles utf-8 fine. SIGH Notepad handles UTF-8 fine (on a scientific sample of 1). But it's utterly ignorant of UNIX line separators. Wordpad handles UNIX line separators on input, but not on output. I guess half is better than none. But it's utterly ignorant of UTF-8. /SIGH Vim on both Ubuntu Linux and OS X seems to be UTF-8 clever, even brilliant. In a document containing both Latin and Cyrillic text, the flip case command ('~') converts majuscule-minuscule for both, both ways. BTW, how can I convert majuscule-minuscule with ISPF EDIT. I know; I could write a macro ... Sheesh! Well, on a command line: c p'' p'' all Or, as a line command: LCC . . . LCC should do it. -Steve -- gil -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
Coming in Windows 14: WordNote, which will handle UTF-8 *and* UNIX line separators!!! On Fri, Jan 10, 2014 at 5:52 PM, Paul Gilmartin paulgboul...@aim.comwrote: On Fri, 10 Jan 2014 10:44:10 -0700, Steve Comstock wrote: On 1/10/2014 10:28 AM, zMan wrote: Cute. Notepad still exists in current Windows, btw. And it handles utf-8 fine. SIGH Notepad handles UTF-8 fine (on a scientific sample of 1). But it's utterly ignorant of UNIX line separators. Wordpad handles UNIX line separators on input, but not on output. I guess half is better than none. But it's utterly ignorant of UTF-8. /SIGH Vim on both Ubuntu Linux and OS X seems to be UTF-8 clever, even brilliant. In a document containing both Latin and Cyrillic text, the flip case command ('~') converts majuscule-minuscule for both, both ways. BTW, how can I convert majuscule-minuscule with ISPF EDIT. I know; I could write a macro ... Sheesh! -- gil -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN -- zMan -- I've got a mainframe and I'm not afraid to use it -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
On Thu, 9 Jan 2014 16:35:55 -0800, Scott Ford wrote: All: � I have a fundamental question on Unicode, or more of how it works . I am confused about the following scenario.. PC ( data using a foreign language Unicode page, like French )� going to z/OS and being keep in tact. Names and address type data. As the application do I have to query the incoming data and find out what the Unicode CECP is then translate to the desired ? or how does it work ? I believe, yes. What is the desired ? iconv may be your friend here, either as a shell command or as a library subroutine, after transferring the file in BINARY. Will Co:Z let the user specify the target code page when transferring a file? -- gil -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
Gil, We send a data message from a pc, we encrypt it with AES128 , the message is received at the host (z/OS) decrypted then converted from ascii to ebcdic..so I am trying to figure out how to Determine what codepage the pc uses and have z/OS convert it to the proper EBCDIC codepage from ASCII. Does that help ? Scott ford www.identityforge.com from my IPAD On Jan 9, 2014, at 7:47 PM, Paul Gilmartin paulgboul...@aim.com wrote: On Thu, 9 Jan 2014 16:35:55 -0800, Scott Ford wrote: All: � I have a fundamental question on Unicode, or more of how it works . I am confused about the following scenario.. PC ( data using a foreign language Unicode page, like French )� going to z/OS and being keep in tact. Names and address type data. As the application do I have to query the incoming data and find out what the Unicode CECP is then translate to the desired ? or how does it work ? I believe, yes. What is the desired ? iconv may be your friend here, either as a shell command or as a library subroutine, after transferring the file in BINARY. Will Co:Z let the user specify the target code page when transferring a file? -- gil -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
There is no such thing as French Unicode. That is the uni part and the beauty of Unicode. There are several flavors of Unicode, but they relate to how the code points are stored in a file or transmitted, not to the character set. All of Unicode is something like a million possible characters (someone will no doubt correct me with the exact number in use). Plain old ABC, French letters like ô, symbols like €, it's all there in one big Unicode. Every letter is always the same, whether you are in America or in France. Now, how do you represent that in a file or whatever? Well, you could use 32 bits for every character. Not very efficient, but certainly straightforward. That is called UTF-32. It's not very common. You could use 16 bits for every character, with some sort of cleverness that yielded two 16-bit words when you had a code point bigger than 65535 (actually somewhat less due to how the cleverness works). That is called UTF-16. Pretty good but still not very efficient. You could use 8 bits for most characters, with cleverness that expanded that out to two or three bytes for more obscure characters. Pretty efficient, and you could make the first part of the character set the same as ASCII, which would make it intuitive for PC folks who know that A is X'41'. That is called UTF-8, and it's pretty good and pretty popular as a result. Most Web pages are in UTF-8 and I believe this e-mail came to you in UTF-8. Okay? Now, define keep it intact. Do you mean bit for bit intact, or do you mean so that when I open it up in ISPF, what looked like an A on the PC now looks like an A in ISPF? If the former, you want a binary transfer, end of story. If the latter, you don't really want to keep it intact, you want to translate Unicode -- and you will need to know which flavor of Unicode encoding (not what country) -- to EBCDIC, which is what ISPF and most COBOL programs expect. Comprende? Charles -Original Message- From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On Behalf Of Scott Ford Sent: Thursday, January 09, 2014 4:36 PM To: IBM-MAIN@LISTSERV.UA.EDU Subject: Subject Unicode All: I have a fundamental question on Unicode, or more of how it works . I am confused about the following scenario.. PC ( data using a foreign language Unicode page, like French ) going to z/OS and being keep in tact. Names and address type data. As the application do I have to query the incoming data and find out what the Unicode CECP is then translate to the desired ? or how does it work ? -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
Scott - The PC is going to have to provide the codepage of the message data someplace in the communication protocol. Either as a separate field, separate message or as a prefix/suffix to the message data. It will be pretty dicey to attempt to guess the codepage based on the message data. One other possibility would be to provide a configuration file to the z/OS side which says what codepage the PC is using. Then the PC would need to actually use the agreed upon codepage. Sam On Thu, Jan 9, 2014 at 5:39 PM, Scott Ford scott_j_f...@yahoo.com wrote: Gil, We send a data message from a pc, we encrypt it with AES128 , the message is received at the host (z/OS) decrypted then converted from ascii to ebcdic..so I am trying to figure out how to Determine what codepage the pc uses and have z/OS convert it to the proper EBCDIC codepage from ASCII. Does that help ? Scott ford www.identityforge.com from my IPAD On Jan 9, 2014, at 7:47 PM, Paul Gilmartin paulgboul...@aim.com wrote: On Thu, 9 Jan 2014 16:35:55 -0800, Scott Ford wrote: All: � I have a fundamental question on Unicode, or more of how it works . I am confused about the following scenario.. PC ( data using a foreign language Unicode page, like French )� going to z/OS and being keep in tact. Names and address type data. As the application do I have to query the incoming data and find out what the Unicode CECP is then translate to the desired ? or how does it work ? I believe, yes. What is the desired ? iconv may be your friend here, either as a shell command or as a library subroutine, after transferring the file in BINARY. Will Co:Z let the user specify the target code page when transferring a file? -- gil -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
Wait. Unicode, or some ASCII variant like, say, a French 7-bit PC code page? Other than with a lot of inferential cleverness, there is no way to look at an ASCII-like file and tell what the code page is. Think about it. The whole problem is that you use X'9B' on your PC to mean ¢ and a Frenchman uses it to mean something else. Your program sees an X'9B' in the file. What does it mean? Someone is going to have to tell you the original code page. Charles -Original Message- From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On Behalf Of Scott Ford Sent: Thursday, January 09, 2014 5:39 PM To: IBM-MAIN@LISTSERV.UA.EDU Subject: Re: Subject Unicode Gil, We send a data message from a pc, we encrypt it with AES128 , the message is received at the host (z/OS) decrypted then converted from ascii to ebcdic..so I am trying to figure out how to Determine what codepage the pc uses and have z/OS convert it to the proper EBCDIC codepage from ASCII. Does that help ? -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
On 9 January 2014 20:39, Scott Ford scott_j_f...@yahoo.com wrote: We send a data message from a pc, we encrypt it with AES128 , the message is received at the host (z/OS) decrypted then converted from ascii to ebcdic..so I am trying to figure out how to Determine what codepage the pc uses and have z/OS convert it to the proper EBCDIC codepage from ASCII. Does that help ? I'm not sure how your question relates to UNICODE. If the data on the PC (Windows, I assume) is in some encoding of UNICODE, then code pages don't really come into play. Any version of Windows can (in theory) use any UNICODE character, regardless of the country or language of installation. So there will be no difference between the way a US English Windows box encodes, say the dollar sign character and the way a French French one does. The UNICODE code point for dollar sign is U+0024, and that's that. But you also mention ASCII, which (loosely) is an 8-bit encoding. (Usually ASCII these days really means some single-byte code page such as ISO 8859-n or one of the Windows ones such as 1242.) There is no general way to convert UNICODE into EBCDIC, because no IBM EBCDIC code pages encodes all UNICODE characters. And if you are talking about single byte EBCDIC code pages such as 037 or 1047, IBM's generally encode 192 characters, vs tens of thousands in UNICODE. If your PC data is in ASCII, i.e. single byte encoding, then you have to both determine the code page in use on the PC, and the one in use on your z/OS, and then use the appropriate mapping. Such a mapping may not exist. For example, if your PC is using a Polish code page, and your z/OS a Western European one such as 1047, there are characters in each that just aren't in the other. Something will break - generally someone's name will be misspelled, or worse. Maybe you can give a short example of what data you have at each end, and what you want to happen to it. Tony H. -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
As of z/OS 2.1, ISPF supports UTF-8, so a binary transfer will still show an A if it was an A on the PC. I don't know which emulators will show all the other bazillion glyphs though... In article 022c01cf0da5$a7b25180$f716f480$@mcn.org you wrote: There is no such thing as French Unicode. That is the uni part and the beauty of Unicode. There are several flavors of Unicode, but they relate to how the code points are stored in a file or transmitted, not to the character set. All of Unicode is something like a million possible characters (someone will no doubt correct me with the exact number in use). Plain old ABC, French letters like ?, symbols like ?, it's all there in one big Unicode. Every letter is always the same, whether you are in America or in France. Now, how do you represent that in a file or whatever? Well, you could use 32 bits for every character. Not very efficient, but certainly straightforward. That is called UTF-32. It's not very common. You could use 16 bits for every character, with some sort of cleverness that yielded two 16-bit words when you had a code point bigger than 65535 (actually somewhat less due to how the cleverness works). That is called UTF-16. Pretty good but still not very efficient. You could use 8 bits for most characters, with cleverness that expanded that out to two or three bytes for more obscure characters. Pretty efficient, and you could make the first part of the character set the same as ASCII, which would make it intuitive for PC folks who know that A is X'41'. That is called UTF-8, and it's pretty good and pretty popular as a result. Most Web pages are in UTF-8 and I believe this e-mail came to you in UTF-8. Okay? Now, define keep it intact. Do you mean bit for bit intact, or do you mean so that when I open it up in ISPF, what looked like an A on the PC now looks like an A in ISPF? If the former, you want a binary transfer, end of story. If the latter, you don't really want to keep it intact, you want to translate Unicode -- and you will need to know which flavor of Unicode encoding (not what country) -- to EBCDIC, which is what ISPF and most COBOL programs expect. Comprende? Charles -Original Message- From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On Behalf Of Scott Ford Sent: Thursday, January 09, 2014 4:36 PM To: IBM-MAIN@LISTSERV.UA.EDU Subject: Subject Unicode All: I have a fundamental question on Unicode, or more of how it works . I am confused about the following scenario.. PC ( data using a foreign language Unicode page, like French ) going to z/OS and being keep in tact. Names and address type data. As the application do I have to query the incoming data and find out what the Unicode CECP is then translate to the desired ? or how does it work ? -- Don Poitras - SAS Development - SAS Institute Inc. - SAS Campus Drive sas...@sas.com (919) 531-5637Cary, NC 27513 -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Subject Unicode
Charles Mills writes: You could use 16 bits for every character, with some sort of cleverness that yielded two 16-bit words when you had a code point bigger than 65535 (actually somewhat less due to how the cleverness works). That is called UTF-16. Pretty good but still not very efficient. In Japan and China, to pick a couple examples, UTF-16 is rather efficient. There are also far worse inefficiencies than using 16 bits to store each Latin character. In short, I wouldn't get *too* hung up on this point, especially as the complete lifecycle costs of storage continue to fall. For example, if you're designing applications and information systems for a global audience (or potentially global audience), it could be a perfectly reasonable decision to standardize on UTF-16 in favor of potential reductions in testing (for example). I think this is exactly what SAP did around the time they introduced their ECC releases, for instance. Somehow I'm reminded of the save two characters impulse which then caused a lot of angst in preparing for Y2K. :-) If there's a reasonable argument for spending 16 bits -- and sometimes there is -- by all means, spend them. This isn't 1974 or even 1994. The vast majority of the world's data are not codepoint-encoded alphanumerics anyway. Timothy Sipples GMU VCT Architect Executive (Based in Singapore) E-Mail: sipp...@sg.ibm.com -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN