Jon Lang wrote:
I don't know enough about Unicode to suggest how to solve this. All I can
say is that my example above should never return a valid Range object unless
there is a way I can specify my own ordering and I use it.

That actually says something: it says that we may want to reconsider
the notion that all string values can be sorted.  You're suggesting
the possibility that "a" cmp "ส้" is, by default, undefined.

I think that a general solution here is to accept that there may be more than one valid way to sort some types, strings especially, and so operators/routines that do sorting should be customizable in some way so users can pick the behaviour they want.

The customization could be applied at various levels, such as using an extra argument or trait for the operator/function that cares about ordering, or by using an extra attribute or trait for the types being sorted.

In fact, this whole issue is very close in concept to the situations where you need to do equality/identity tests.

With strings, identity tests can change answers depending on whether you are doing it on language-dependent or language-independent graphemes, and Perl 6 encodes that abstraction level as value metadata.

When you want to be consistent, the behaviour of "cmp" affects all of the other order-sensitive operations, including any working with intervals.

Some possible examples of customization:

  $foo ~~ $a..$b :QuuxNationality  # just affects this one test

  $bar = 'hello' :QuuxNationality  # applies anywhere the Str value is used

Also, declaring a Str subtype or something.

Of course, after all this, we still want some reasonable default. I suggest that for Str that aren't nationality-specific, the default ordering semantics are by whatever generic ordering Unicode defines, which might be by codepoint. And then for Str with nationality-specific grapheme abstractions, the default sorting can be whatever is the case for that nationality. And this is how it is except where users define some other order.

So then, "a" cmp "ส้" is always defined, but users can change the definition.

-- Darren Duncan

Reply via email to