Re: [CMS-PIPELINES] XLATE (or any method) for converting UTF-8 to/from EBCDIC.

John P. Hartmann Tue, 20 Nov 2012 01:47:02 -0800

There is an undocumented UTF stage to convert between the three
encodings.  Once you have UTF-32, you might find it easier to
translate back to EBCDIC.  As it is undocumented, testing is probably
incomplete.  Ymmv.


* This filter converts between UTF-8, UTF-16, and UTF32.
*
*  UTF         [FROM] { [MODIFIED] 8 | 16 | 32 }
*              [TO]   { [MODIFIED] 8 | 16 | 32 }
*              [REPORT]
*
* See  http://www.unicode.org/versions/Unicode5.2.0/  for  the
* current standard as of  this writing.  Resist the temptation
* to  read RFC  3629 (which  obsoletes  RFC 2279);  this is  a
* somewhat confused  derivative work.  And stay  away form the
* Wikipedia except for the diagrams.

* In modified UTF-8  u0000 is encoded as  a two-byte sequenced
* and x'00'  is not a  valid encoded character  (thus allowing
* normal  null-terminated  string  semantics).   This  is  the
* encoding used by Java for character data.
*
* Input  is  always  validated  as follows.   When  REPORT  is
* specified, an  error message is issued  and processing stops
* on error; otherwise U+FFFD is substituted for the input code
* point.
*
* UTF-8:
*
* The first byte of a code  point must not contain b'10xxxxxx'
* or  b'11111xxx';  subsequent  bytes, if  any,  must  contain
* b'10xxxxxx'; the  code points  reserved for  surrogates must
* not  be  present;  and  the largest  code  point  allowed  is
* U+10FFFF, which is representable in four bytes.
*
* UTF-16:
*
* All 16-bit numbers are allowed,  except surrogate low on its
* own and surrogate high not followed by surrogate low.
*
* UTF-32:
*
* Any  number less  than x'110000'  is valid,  except for  the
* surrogates (x'D800' through x'C000').

On 19 November 2012 19:50, Larson, John E. <[email protected]> wrote:

Re: [CMS-PIPELINES] XLATE (or any method) for converting UTF-8 to/from EBCDIC.

Reply via email to