Re: strcasecmp() comparing punctuation in ASCII?

Kirk Wolf Fri, 02 Jun 2017 07:06:32 -0700

The IEEE Std 1003.1, 2004 Edition was mentioned, but please note that it
says this:


"In the POSIX locale, *strcasecmp*() and *strncasecmp*() shall behave as if
the strings had been converted to lowercase and then a byte comparison
performed. *The results are unspecified in other locales.*"

This is interesting in that it points to the next question:  what locale
are you running under?

The strcasecmp() function in XLC/C++ is poorly documented.  It only says:
     "The strcasecmp() function is locale-sensitive".

In z/OS XLC/C++, the default if you are running POSIX(ON) is the "POSIX C"
locale.
If this is the case for you, then the above statement from the standard
would mean that:

    strcasecmp(a,b)  ==  strcmp(tolower(a), tolower(b))
         # assuming a tolower(char*) function based on tolower(char)

But you aren't seeing this.

So, either:

a) you aren't running with the POSIX C locale  (where the collation of
strcasecmp() is undefined by the standard).

b) you are running with the POSIX C locale, but IBM didn't follow the
standard.

According to the XLC/C++ doc:
" The POSIX C locale uses the ASCII collation sequence; the first 128 ASCII
characters are defined in the collation sequence, and the remaining EBCDIC
characters are at the end of the collating sequence."

Is this what you are seeing?  If so, then XLC/C++ strcasecmp() uses
LC_COLLATE for the POSIX C locale (and not byte comparison as specified by
the standard).   Or maybe locale "POSIX C" != "POSIX".  Who knows.

Note:  if you are using the uppercase of a word/phrase as a key, you might
consider saving the uppercase/lowercase key and then using strcmp() or
strcoll() to compare.    Or you could define your own collation sequence
via a translate table and then use the translated string as the key with
strcmp().
Using strcmp() for things like sort will probably be much faster anyway
since it will be inlined using the CLST instruction, wherease strcasecmp()
will be a function call.

More information on XLC/C++ locales:
https://www.ibm.com/support/knowledgecenter/SSLTBW_2.1.0/com.ibm.zos.v2r1.cbcpx01/cloc.htm


Kirk Wolf
Dovetailed Technologies
http://dovetail.com

On Fri, Jun 2, 2017 at 8:19 AM, Charles Mills <[email protected]> wrote:

> A couple of takeaways from all of this, at least for me:
>
> 1. Rather surprisingly, strcmp() and strcasecmp() do not return the same
> results *for strings that are not mixed case*! For example, for ("ABC",
> "123") strcmp() would return < but strcasecmp() would return > (untested).
> Consider the following bit of program logic, which would fail: Build a
> table
> of userid's from some system source. Because they come from the system they
> are all uppercase, so they could be sorted using strcmp(). When a user
> enters a userid, look it up in the table using a binary search. Since the
> user might enter the id in mixed case, search using strcasecmp(). The
> binary
> search would fail. (Yes, you could first uppercase the user input and use
> strcmp(). This is an illustrative example, not a "problem.")
>
> 2. One has to be very careful mixing strcasecmp() with roll-your-own
> compares such as if ( tolower(left[i]) != tolower(right[i]) ) ... I am
> checking my code for this issue.
>
> Charles
>
>
> -----Original Message-----
> From: IBM Mainframe Discussion List [mailto:[email protected]] On
> Behalf Of Charles Mills
> Sent: Thursday, June 1, 2017 4:45 PM
> To: [email protected]
> Subject: Re: strcasecmp() comparing punctuation in ASCII?
>
> > strcasecmp() is obliged to convert all upper-case letters into
> > lower-case
> for the comparison
>
> I don't think it is as simple-minded as that (no offense -- you're not
> simple-minded either <g>).
>
> I think @John McKown pretty much nailed it. It's an "abstract" compare that
> just happens (well, a little more than just happens) to largely conform to
> ASCII.
>
> I think the -37 and 122 confirms that it is not a simple "subtract one
> ASCII
> value from another." And yes, I picked 'Z' and '0' specifically because
> they
> order differently ASCII versus EBCDIC. I've deleted my test code now but an
> interesting case would be 'A' and 'b', which differ in order between EBCDIC
> and ASCII also. A guess would be that the result would be +1, because they
> should be one entry apart in the abstract table, unlike their code points,
> which are x'20' or x'40' apart. ('A' and 'a' of course compare equal --
> that's the whole point, isn't it?)
>
> Yes, the results were astonishing. I assumed a bug in my code.
>
> My code is not only working correctly, it's superfluous! I do initial
> development and alpha test on Windows. I have some
> hard-coded-in-collating-order tables that I binary search. Most of them use
> all-alpha keys and so that order does not matter ASCII (Windows) versus
> EBCDIC (MVS). A few tables have mixed alpha, numerics and/or punctuation
> and
> so those I sort into collating sequence on start-up. Well, I needn't have
> bothered! As you indicate, lower_bound results are consistent between
> Windows and MVS.
> strcasecmp() is obliged to convert all upper-case letters into lower-case
> for the comparison.
>
> Lower-case 'z' in EBCDIC is 0xa9, '0' in EBCDIC is 0xf0.  0xA9 is less than
> 0xF0; so I would expect strcasecmp() to return a value less than zero.
>
> But - as you point out - in ASCII, 'z' is 0x7a, and '0' is 0x30.  0x7a is
> greater than 0x30, so on an ASCII platform I would expect this to return a
> positive value (note that 0x7a - 0x30 is 74 - not the 122 that IBM
> returned,
> although both are positive.)
>
> IBM must be mapping the values to some abstract code-points and then
> subtracting those...
>
> It _would_ mean that the strcasecmp() results are consistent between ASCII
> and EBCDIC environments... which might sometimes be a desirable
> characteristic - and other times not...
> For IBM-MAIN subscribe / signoff / archive access instructions, send email
> to [email protected] with the message: INFO IBM-MAIN
>
> ----------------------------------------------------------------------
> For IBM-MAIN subscribe / signoff / archive access instructions, send email
> to [email protected] with the message: INFO IBM-MAIN
>
> ----------------------------------------------------------------------
> For IBM-MAIN subscribe / signoff / archive access instructions,
> send email to [email protected] with the message: INFO IBM-MAIN
>

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN

Re: strcasecmp() comparing punctuation in ASCII?

Reply via email to