Approaching this with the notion firmly in mind that infix:<..> is
supposed to be used for matching ranges while infix:<...> should be
used to generate series:

Aaron Sherman wrote:
> Walk with me a bit, and let's explore the concept of intuitive character
> ranges? This was my suggestion, which seems pretty basic to me:
>
> "x .. y", for all strings x and y, which are composed of a single, valid
> codepoint which is neither combining nor modifying, yields the range of all
> valid, non-combining/modifying codepoints between x and y, inclusive which
> share the Unicode script, general category major property and general
> category minor property of either x or y (lack of a minor property is a
> valid value).

This is indeed true for both range-matching and series-generation as
the spec is currently written.

> In general we have four problems with current specification and
> implementation on the Perl 6 and Perl 5 sides:
>
> 1) Perl 5 and Rakudo have a fundamental difference of opinion about what
> some ranges produce ("A" .. "z", "X" .. "T", etc) and yet we've never really
> articulated why we want that.
>
> 2) We deny that a range whose LHS is "larger" than its RHS makes sense, but
> we also don't provide an easy way to construct such ranges lazily otherwise.
> This would be annoying only, but then we have declared that ranges are the
> right way to construct basic loops (e.g. for (1..1e10).reverse -> $i {...}
> which is not lazy (blows up your machine) and feels awfully clunky next to
> for 1e10..1 -> $i {...} which would not blow up your machine, or even make
> it break a sweat, if it worked)

With ranges, we want C< when $LHS .. $RHS" > to always mean C<< if
$LHS <= $_ <= $RHS >>.  If $RHS < $LHS, then the range being specified
is not valid.  In this context, it makes perfect sense to me why it
doesn't generate anything.

With series, we want C< $LHS ... $RHS > to generate a list of items
starting with $LHS and ending with $RHS.  If $RHS > $LHS, we want it
to increment one step at a time; if $RHS < $LHS, we want it to
decrement one step at a time.

So: 1) we want different behavior from the Range operator in Perl 6
vs. Perl 5 because we have completely re-envisioned the range
operator.  What we have replaced it with is fundamentally more
flexible, though not necessarily perfect.

> 3) We've never had a clear-cut goal in allowing string ranges (as opposed to
> character ranges, which Perl 5 and 6 both muddy a bit), so "intuitive"
> becomes sketchy at best past the first grapheme, and ever muddier when only
> considering codepoints (thus that wing of my proposal and current behavior
> are on much shakier ground, except in so far as it asserts that we might
> want to think about it more).

I think that one notion that we're dealing with here is the idea that
C<< $X < $X.succ >> for all strings.  This seems to be a rather
intuitive assumption to make; but it is apparently not an assumption
that Stringy.succ makes.  As I understand it, "Z".succ eqv "AA".  What
benefit do we gain from this behavior?  Is it the idea that eventually
this will iterate over every possible combination of capital letters?
If so, why is that a desirable goal?


My own gut instinct would be to define the string iterator such that
it increments the final letter in the string until it gets to "Z";
then it resets that character to "A" and increments the next character
by one:

"ABE", "ABF", "ABG" ... "ABZ", "ACA", "ACB" ... "ZZZ"

This pattern ensures that for any two strings in the series, the first
one will be less than its successor.  It does not ensure that every
possible string between "ABE" and "ZZZ" will be represented; far from
it.  But then, 1...9 doesn't produce every number between 1 and 9; it
only produces integers.  Taken to an extreme: pi falls between 1 and
9; but no one in his right mind expects us to come up with a general
sequencing of numbers that increments from 1 to 9 with a guarantee
that it will hit pi before reaching 9.

Mind you, I know that the above is full of holes.  In particular, it
works well when you limit yourself to strings composed of capital
letters; do anything fancier than that, and it falls on its face.

> 4) Many ranges involving single characters on LHS and RHS result in null
> or infinite output, which is deeply non-intuitive to me, and I expect many
> others.

Again, the distinction between range-matching and series-generation
comes to the rescue.

> Solve those (and I tried in my suggestion) and I think you will be able to
> apply intuition to character ranges, but only in so far as a human being is
> likely to be able to intuit anything related to Unicode.

Of the points that you raise, #1, 2, and 4 are neatly solved already.
I'm unsure as to #3; so I'd recommend focusing some scrutiny on it.

> The current behaviour of the range operator is (if I recall correctly):
>> 1) if both sides are single characters, make a range by incrementing
>> codepoints
>>
>
> Sadly, you can't do that reasonably. Here are some examples of why, using
> only Latin and Greek as examples (not the most convoluted Unicode sections
> to be sure):

Bear in mind that I don't know much about Unicode; so please humor me:

>   - "Α" (capital Greek alpha, not Latin A) .. "Ω" would result in a range
>   that contains an invalid codepoint (rakudo: drops the invalid codepoint,
>   which you may have meant to imply, but I'm being pedantic because I want to
>   come to a specification, not just a sense of the right solution)

Which codepoint is invalid, and why?

>   - "Ā" .. "Ē" would be "ĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒ" which is really not what
>   you're likely to expect! (rakudo: Ā, infinitely repeating, which is an even
>   larger problem for Katakana, where "オ" .. "ヺ" seems a very intuitive way to
>   say "all Katakana non-cased letters" but fails because the range contains
>   both cased and uncased; Perl 5 just prints "オ", and I think it also sneers
>   at you)

In both of these cases, what do you think it should produce?

>   - "A" .. "z" comes out really odd because it contains punctuation (mind
>   you, your suggestion is saner than Rakudo's current behavior on "A" .. "z"
>   which is an infinite progression of capital-letter-only sequences of 1 or
>   more characters! Intuitive, it's not.)

I have to agree here, on all points.

I also have to wonder how or if "0" ... "z" ought to be resolved.  If
you're thinking in terms of the alphabet or digits, this is
nonsensical; if you're thinking in terms of alphanumerics, you might
get the equivalent of "0" ... "9", "A" ... "Z", "a" ... "z".  If
you're thinking in terms of raw characters, you should get "0" ...
"9", ":", ";", "<", "=", ">", "?", "@", "A" ... "Z", "[", "\", "]",
"^", "_", "`", "a" ... "z".

> PPS: Other unexpected results in Rakudo, all related to the behavior that
> Rakudo seems to have around ranges that it doesn't think are legitimate for
> ranges: it repeats the LHS infinitely:
>
>  "䷀" .. "䷿"  - expected: all hexagram characters; got: first character,
> infinitely repeating.
> "鐀" .. "鐅" - expected: all CJK Unified Ideographs between u+9400 and u+9405;
> got: first character, infinitely repeating.
> "٠" .. "٩" - expected: all Arabic-Indic digits zero through nine; got: first
> digit (zero) repeating (note: bidi may confuse display in this email)
> "א" .. "ת" - expected: all Hebrew letters; got: first character (א)
> repeating (note: bidi may confuse display in this email)
> "A" .. "E" - expected: all full width, capital letters A through E; got:
> full width A repeating.

This does indeed seem to be a problem.  In all cases, though, I think
that the problem lies in how .succ is defined for strings, rather than
how the range and/or series operators work.

-- 
Jonathan "Dataweaver" Lang

Reply via email to