We are making progress toward supporting Unicode gcobol. Along the way
we encountered a strange (to us) feature while converting EBCDIC
(CP1140) to ... anything. I don't understand the rationale behind the
implementation.
COBOL has a notion of "high-value", which is guaranteed to be the
"highest" value in a character set. The reference manual for COBOL
from IBM states:
For alphanumeric data with the EBCDIC collating sequence,
[HIGH-VALUE] is X'FF'.
Emphatically high-value is *not* like EOF in C. It's part of the
character set, not a value outside it.
In GNU iconv, the value 0xFF does not convert to the same value in
Unicode.
For UTF-8 to CP1140, 0xFF becomes 0xDF
For CP1140 to UTF-8, 0xFF becomes 0x9F
In libiconv, we see that in the last values of the tables in
lib/ebcdic1140.h. The end of ebcdic1140_2uni has:
/* 0xf0 */
0x0030, 0x0031, 0x0032, 0x0033, 0x0034, 0x0035, 0x0036, 0x0037,
0x0038, 0x0039, 0x00b3, 0x00db, 0x00dc, 0x00d9, 0x00da, 0x009f,
and ebcdic1140_2uni:
0x8c, 0x49, 0xcd, 0xce, 0xcb, 0xcf, 0xcc, 0xe1, /* 0xf0-0xf7 */
0x70, 0xdd, 0xde, 0xdb, 0xdc, 0x8d, 0x8e, 0xdf, /* 0xf8-0xff */
Is there a documented reason DF was chosen to work this way, and why FF
was rejected? Given IBM's statement, to these innocent eyes it looks
like a bug.
--jkl
On Tue, 07 Oct 2025 13:15:44 +0200
Bruno Haible <[email protected]> wrote:
> James K. Lowden wrote:
> > > 1. "National" support. COBOL programs define the runtime
> > > encoding and collation of each string (sometimes implicitly).
> > > COBOL defines two encodings: "alphanumeric" and "national".
> > > Every alphanumeric (and national) variable and literal has a
> > > defined runtime encoding that is distinct from the compile-time
> > > and runtime locale, and from the encoding of the source code.
> > > This means
> > >
> > > MOVE 'foo' TO FOO.
> > >
> > > may involve iconv(3) and
> > >
> > > IF 'foo' = FOO
> > >
> > > is defined as true/false depending on the *characters*
> > > represented, not their encoding. That 'foo' could be CP1140
> > > (single-byte EBCDIC) and FOO could be UTF-16.
> > > ...
> > > Conversion is a solved problem. Comparison is not.
>
> Comparison consists of two steps:
> 1) Convert both operands to Unicode. (Can be UTF-8, UTF-16, or
> UTF-32, which one does not matter.)
> 2) If a "closed world" assumption is valid:
> Compare the two Unicode strings.
> Otherwise:
> Convert the two Unicode strings to normalization form NFD, and
> compare the results.
>
> By "closed world" I mean: Unicode text exchanged between programs
> is typically assumed to be in Unicode normalization form NFC. See
> https://www.unicode.org/faq/normalization.html#2 . If this assumption
> holds, you don't need the normalization step above. Whereas if it
> does not hold, for example, because the program can read arbitrary
> text files, you need this normalization step.
>
> Paul Koning wrote:
> > Unicode comparison is addressed by the "stringprep" library.
>
> Careful: "stringprep" does extra steps, which drop characters. See
> https://datatracker.ietf.org/doc/html/rfc3454#section-3
>
> > > 2) a limited amount
> > > of Unicode evaluation is available in (IIRC) gnulib
>
> Correct. The comparison without normalization is available in
> libunistring as functions u8_cmp, u16_cmp, u32_cmp
> https://www.gnu.org/software/libunistring/manual/html_node/Comparing-Unicode-strings.html
> or u8_strcmp, u16_strcmp, u32_strcmp:
> https://www.gnu.org/software/libunistring/manual/html_node/Comparing-NUL-terminated-Unicode-strings.html
> Whereas the comparison with normalization is available as
> functions u8_normcmp, u16_normcmp, u32_normcmp:
> https://www.gnu.org/software/libunistring/manual/html_node/Normalizing-comparisons.html
>
> In Gnulib, each of these functions is available as a Gnulib module:
> https://www.gnu.org/software/gnulib/manual/html_node/How-to-use-libunistring.html
> https://www.gnu.org/software/gnulib/manual/html_node/_003cunistr_002eh_003e-modules.html
> https://www.gnu.org/software/gnulib/manual/html_node/_003cuninorm_002eh_003e-modules.html
>
> Jose Marchesi writes:
> > It would be good to avoid duplicating that code though.
>
> Especially as Unicode normalization is a rather complicated algorithm,
> that includes data tables that change with every Unicode version.
> If you duplicate that code, upgrades to newer Unicode versions (that
> are released once a year) don't come for free. Whereas if you use
> libunistring or Gnulib, they do come for free.
>
> Bruno
>
>
>