Re: strcasecmp() comparing punctuation in ASCII?

Charles Mills Fri, 02 Jun 2017 06:08:14 -0700

And of course on the mainframe we get in the habit of using character and byte 
nearly interchangeably. The C language does not help: it uses char to mean an 
8-bit integer.

Charles

-----Original Message-----
From: IBM Mainframe Discussion List [mailto:[email protected]] On Behalf 
Of John McKown
Sent: Friday, June 2, 2017 5:43 AM
To: [email protected]
Subject: Re: strcasecmp() comparing punctuation in ASCII?

On Thu, Jun 1, 2017 at 10:32 PM, Paul Gilmartin < 
[email protected]> wrote:

> On Thu, 1 Jun 2017 18:23:09 -0400, Thomas David Rivers wrote:
> >>
> >I find that very odd...
> >
> >strcasecmp() is obliged to convert all upper-case letters into 
> >lower-case for the comparison.
> >
> Wouldn't it be a ﬁasco if it eﬀectively waﬄed on ligatures?
>

Hum, I can't see where ligatures are of any concern. Assuming that I understand 
them, they are just a result of very tight kerning of two separate letters. 
E.g. "tucking" the "i" under the "roof" of the letter "F". In memory this is 
still "Fi" - two separate runes (in Go speak - they distinguish "character" 
versus "rune" or "UNICODE code point". ref:
https://blog.golang.org/strings)
[quote]
...

Code points, characters, and runes

We've been very careful so far in how we use the words "byte" and "character". 
That's partly because strings hold bytes, and partly because the idea of 
"character" is a little hard to define. The Unicode standard uses the term 
"code point" to refer to the item represented by a single value. The code point 
U+2318, with hexadecimal value 2318, represents the symbol ⌘. (For lots more 
information about that code point, see its Unicode
page.)

To pick a more prosaic example, the Unicode code point U+0061 is the lower case 
Latin letter 'A': a.

But what about the lower case grave-accented letter 'A', à? That's a character, 
and it's also a code point (U+00E0), but it has other representations. For 
example we can use the "combining" grave accent code point, U+0300, and attach 
it to the lower case letter a, U+0061, to create the same character à. In 
general, a character may be represented by a number of different sequences of 
code points, and therefore different sequences of UTF-8 bytes.

The concept of character in computing is therefore ambiguous, or at least 
confusing, so we use it with care. To make things dependable, there are 
normalization techniques that guarantee that a given character is always 
represented by the same code points, but that subject takes us too far off the 
topic for now. A later blog post will explain how the Go libraries address 
normalization.

"Code point" is a bit of a mouthful, so Go introduces a shorter term for the 
concept: rune. The term appears in the libraries and source code, and means 
exactly the same as "code point", with one interesting addition.
For IBM-MAIN subscribe / signoff / archive access instructions, send email to 
[email protected] with the message: INFO IBM-MAIN

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN

Re: strcasecmp() comparing punctuation in ASCII?

Reply via email to