There is an undocumented UTF stage to convert between the three
encodings. Once you have UTF-32, you might find it easier to
translate back to EBCDIC. As it is undocumented, testing is probably
incomplete. Ymmv.
* This filter converts between UTF-8, UTF-16, and UTF32.
*
* UTF [FROM] { [MODIFIED] 8 | 16 | 32 }
* [TO] { [MODIFIED] 8 | 16 | 32 }
* [REPORT]
*
* See http://www.unicode.org/versions/Unicode5.2.0/ for the
* current standard as of this writing. Resist the temptation
* to read RFC 3629 (which obsoletes RFC 2279); this is a
* somewhat confused derivative work. And stay away form the
* Wikipedia except for the diagrams.
* In modified UTF-8 u0000 is encoded as a two-byte sequenced
* and x'00' is not a valid encoded character (thus allowing
* normal null-terminated string semantics). This is the
* encoding used by Java for character data.
*
* Input is always validated as follows. When REPORT is
* specified, an error message is issued and processing stops
* on error; otherwise U+FFFD is substituted for the input code
* point.
*
* UTF-8:
*
* The first byte of a code point must not contain b'10xxxxxx'
* or b'11111xxx'; subsequent bytes, if any, must contain
* b'10xxxxxx'; the code points reserved for surrogates must
* not be present; and the largest code point allowed is
* U+10FFFF, which is representable in four bytes.
*
* UTF-16:
*
* All 16-bit numbers are allowed, except surrogate low on its
* own and surrogate high not followed by surrogate low.
*
* UTF-32:
*
* Any number less than x'110000' is valid, except for the
* surrogates (x'D800' through x'C000').
On 19 November 2012 19:50, Larson, John E. <[email protected]> wrote: