Re: strcasecmp() comparing punctuation in ASCII?

John McKown Fri, 02 Jun 2017 05:43:19 -0700

On Thu, Jun 1, 2017 at 10:32 PM, Paul Gilmartin <
[email protected]> wrote:

> On Thu, 1 Jun 2017 18:23:09 -0400, Thomas David Rivers wrote:
> >>
> >I find that very odd...
> >
> >strcasecmp() is obliged to convert all upper-case letters into lower-case
> >for the comparison.
> >
> Wouldn't it be a ﬁasco if it eﬀectively waﬄed on ligatures?
>

Hum, I can't see where ligatures are of any concern. Assuming that I
understand them, they are just a result of very tight kerning of two
separate letters. E.g. "tucking" the "i" under the "roof" of the letter
"F". In memory this is still "Fi" - two separate runes (in Go speak - they
distinguish "character" versus "rune" or "UNICODE code point". ref:
https://blog.golang.org/strings)
[quote]
...

Code points, characters, and runes

We've been very careful so far in how we use the words "byte" and
"character". That's partly because strings hold bytes, and partly because
the idea of "character" is a little hard to define. The Unicode standard
uses the term "code point" to refer to the item represented by a single
value. The code point U+2318, with hexadecimal value 2318, represents the
symbol ⌘. (For lots more information about that code point, see its Unicode
page.)

To pick a more prosaic example, the Unicode code point U+0061 is the lower
case Latin letter 'A': a.

But what about the lower case grave-accented letter 'A', à? That's a
character, and it's also a code point (U+00E0), but it has other
representations. For example we can use the "combining" grave accent code
point, U+0300, and attach it to the lower case letter a, U+0061, to create
the same character à. In general, a character may be represented by a
number of different sequences of code points, and therefore different
sequences of UTF-8 bytes.

The concept of character in computing is therefore ambiguous, or at least
confusing, so we use it with care. To make things dependable, there are
normalization techniques that guarantee that a given character is always
represented by the same code points, but that subject takes us too far off
the topic for now. A later blog post will explain how the Go libraries
address normalization.

"Code point" is a bit of a mouthful, so Go introduces a shorter term for
the concept: rune. The term appears in the libraries and source code, and
means exactly the same as "code point", with one interesting addition.

[quote/]

>
> -- gil
>

-- 
Windows. A funny name for a operating system that doesn't let you see
anything.

Maranatha! <><
John McKown

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN

Re: strcasecmp() comparing punctuation in ASCII?

Reply via email to