Re: Suggested magic for "a" .. "b"

Aaron Sherman Tue, 20 Jul 2010 22:28:37 -0700

OK, there's a lot here and my head is swimming, so let me re-consolidate and
re-state (BTW: thanks Jon, you've really helped me understand, here).

1) The spec is somewhat vague, but the proposal that I made for single
characters is not an unreasonable interpretation of what's there. Thus, we
could adopt the script/major cat/minor cat triplet as the core tool that
.succ will use for single, non-combining, non-modifying, valid characters?

2) The spec doesn't put this information anywhere near the definition of the
range operator. Perhaps we can make a note? This was a source of confusion
for me.

3) It seems that there are two competing multi-character approaches and both
seem somewhat valid. Should we use a pragma to toggle behavior between A and
B:

A: "aa" .. "bb" contains "az"
B: "aa" .. "bb" contains ONLY "aa", "ab", "ba" and "bb"

4) About the ranges I gave as examples, you asked:

"Which codepoint is invalid, and why?"

There's just an undefined codepoint smack in the middle of the Greek
uppercase letters (U+03A2). I'm sure the Unicode specs have a rationale for
that somewhere, but my guess is that there's some thousand-year-old debate
about the Greek alphabet behind it.

"In both of these cases, what do you think it should produce?"

I actually gave that answer a bit later on. I think that "Ā" .. "Ē" should
produce ĀĂĄĆĈĊČĎĐĒ and オ .. ヺ should produce
オカガキギクグケゲコゴサザシジスズセゼソゾタダチヂツヅテデトドナニヌネノハバパヒビピフブプヘベペホボポマミムメモヤユヨラリルレロワヰヱヲンヴヷヸヹヺ
which are all of the Katakana syllabic characters.

"I also have to wonder how or if "0" ... "z" ought to be resolved. If
you're thinking in terms of the alphabet or digits, this is
nonsensical"

Well, since you agreed with my statement about the properties checking, it
would be 0 through 9 and then a through z because 0 through 9 are Latin
numbers, matching the LHS's properties and a through z are lowercase Latin
letters, matching the RHS's properties.

For reference, this is the relevant section of the spec:

Character positions are incremented within their natural range for any
Unicode range that is deemed to represent the digits 0..9 or that is deemed
to be a complete cyclical alphabet for (one case of) a (Unicode) script.
Only scripts that represent their alphabet in codepoints that form a cycle
independent of other alphabets may be so used. (This specification defers to
the users of such a script for determining the proper cycle of letters.) We
arbitrarily define the ASCII alphabet not to intersect with other scripts
that make use of characters in that range, but alphabets that intersperse
ASCII letters are not allowed.

I'm not sure that all of that tracks with the Unicode standard's use of some
of the terms, but based on what we've discussed, perhaps we could get more
specific there:

Character positions are incremented within their Unicode Script, but only in
keeping with their General Category property. Thus C<"A"++> yields C<"B">
which is the next codepoint, but C<"Ă"++> yields C<"Ą"> even though "ą"
falls between the two, when incrementing codepoints. Should this prove
problematic for any specific Unicode Script which requires special handling
(e.g. because a "letter" really isn't used as a letter at all), such special
handling may be applied, but the above is the general rule.

and then in the section on ranges:

As discussed previously, incrementing a character (which is to say, invoking
C<.succ>) seeks the next codepoint with the same Unicode Script and General
Category properties (major and minor category to be specific). For ranges,
succession is the same if .min and .max have the same properties, but if
they do not, then all codepoints are considered which are greater than
C<.min> and smaller than C<.max> and which agree with either the properties
of C<.min> I<or> the properties of C<.max>

Re: Suggested magic for "a" .. "b"

Reply via email to