We are making progress toward supporting Unicode gcobol.  Along the way
we encountered a strange (to us) feature while converting EBCDIC
(CP1140) to ... anything.  I don't understand the rationale behind the
implementation.  

COBOL has a notion of "high-value", which is guaranteed to be the
"highest" value in a character set.  The reference manual for COBOL
from IBM states:

        For alphanumeric data with the EBCDIC collating sequence, 
        [HIGH-VALUE] is X'FF'.

Emphatically high-value is *not* like EOF in C.  It's part of the
character set, not a value outside it.  

In GNU iconv, the value 0xFF does not convert to the same value in
Unicode.  

        For UTF-8 to CP1140, 0xFF becomes 0xDF
        For CP1140 to UTF-8, 0xFF becomes 0x9F

In libiconv, we see that in the last values of the tables in
lib/ebcdic1140.h. The end of ebcdic1140_2uni has:

  /* 0xf0 */
  0x0030, 0x0031, 0x0032, 0x0033, 0x0034, 0x0035, 0x0036, 0x0037,
  0x0038, 0x0039, 0x00b3, 0x00db, 0x00dc, 0x00d9, 0x00da, 0x009f,

and ebcdic1140_2uni: 

  0x8c, 0x49, 0xcd, 0xce, 0xcb, 0xcf, 0xcc, 0xe1, /* 0xf0-0xf7 */
  0x70, 0xdd, 0xde, 0xdb, 0xdc, 0x8d, 0x8e, 0xdf, /* 0xf8-0xff */

Is there a documented reason DF was chosen to work this way, and why FF
was rejected?  Given IBM's statement, to these innocent eyes it looks
like a bug.  

--jkl



On Tue, 07 Oct 2025 13:15:44 +0200
Bruno Haible <[email protected]> wrote:

> James K. Lowden wrote:
> > > 1.  "National" support.  COBOL programs define the runtime
> > > encoding and collation of each string (sometimes implicitly).
> > > COBOL defines two encodings: "alphanumeric" and "national".
> > > Every alphanumeric (and national) variable and literal has a
> > > defined runtime encoding that is distinct from the compile-time
> > > and runtime locale, and from the encoding of the source code.
> > > This means
> > >
> > >   MOVE 'foo' TO FOO.
> > >
> > > may involve iconv(3) and 
> > >
> > >   IF 'foo' = FOO
> > >
> > > is defined as true/false depending on the *characters*
> > > represented, not their encoding.  That 'foo' could be CP1140
> > > (single-byte EBCDIC) and FOO could be UTF-16.  
> > > ...
> > > Conversion is a solved problem.  Comparison is not.
> 
> Comparison consists of two steps:
>   1) Convert both operands to Unicode. (Can be UTF-8, UTF-16, or
> UTF-32, which one does not matter.)
>   2) If a "closed world" assumption is valid:
>        Compare the two Unicode strings.
>      Otherwise:
>        Convert the two Unicode strings to normalization form NFD, and
>        compare the results.
> 
> By "closed world" I mean: Unicode text exchanged between programs
> is typically assumed to be in Unicode normalization form NFC. See
> https://www.unicode.org/faq/normalization.html#2 . If this assumption
> holds, you don't need the normalization step above. Whereas if it
> does not hold, for example, because the program can read arbitrary
> text files, you need this normalization step.
> 
> Paul Koning wrote:
> > Unicode comparison is addressed by the "stringprep" library.
> 
> Careful: "stringprep" does extra steps, which drop characters. See
> https://datatracker.ietf.org/doc/html/rfc3454#section-3
> 
> > > 2) a limited amount
> > > of Unicode evaluation is available in (IIRC) gnulib
> 
> Correct. The comparison without normalization is available in
> libunistring as functions u8_cmp, u16_cmp, u32_cmp
> https://www.gnu.org/software/libunistring/manual/html_node/Comparing-Unicode-strings.html
> or u8_strcmp, u16_strcmp, u32_strcmp:
> https://www.gnu.org/software/libunistring/manual/html_node/Comparing-NUL-terminated-Unicode-strings.html
> Whereas the comparison with normalization is available as
> functions u8_normcmp, u16_normcmp, u32_normcmp:
> https://www.gnu.org/software/libunistring/manual/html_node/Normalizing-comparisons.html
> 
> In Gnulib, each of these functions is available as a Gnulib module:
> https://www.gnu.org/software/gnulib/manual/html_node/How-to-use-libunistring.html
> https://www.gnu.org/software/gnulib/manual/html_node/_003cunistr_002eh_003e-modules.html
> https://www.gnu.org/software/gnulib/manual/html_node/_003cuninorm_002eh_003e-modules.html
> 
> Jose Marchesi writes:
> > It would be good to avoid duplicating that code though.
> 
> Especially as Unicode normalization is a rather complicated algorithm,
> that includes data tables that change with every Unicode version.
> If you duplicate that code, upgrades to newer Unicode versions (that
> are released once a year) don't come for free. Whereas if you use
> libunistring or Gnulib, they do come for free.
> 
> Bruno
> 
> 
> 

Reply via email to