Oh bother, I wrote this up last night, but forgot to send it. Here y'all go:
I've been testing ".." recently, and it seems, in Rakudo, to behave like Perl 5. That is, the magic auto-increment for "a" .. "z" works very wonkily, given any range that isn't within some very strict definitions (identical Unicode general category, increasing, etc.) So the following: "A" .. "z" produces very odd results. I'd like to suggest that we re-define this operator on strings as follows: RESTRICTIONS: First off, if either argument contains combining, modifying, undefined, reserved or other codepoints which either cannot be treated as a single, independent "character" or whose Unicode properties are not firmly established in the Unicode specification, then an exception is immediately raised. This must be done in order to assure that each character index can be compared to each corresponding character index without the typical Unicode ambiguities. Ligatures and other decomposable sequences are treated by their codepoint in the current encoding, only. Treatment of strings whose encodings differ should be possible, as all comparisons are performed on codepoints. If either argument is zero length, an exception is raised. If either one argument is *, then it is assumed to stand for the largest (RHS) or smallest (LHS) codepoint with the same Unicode general properties as the opposite side (for each character index, if the other value is a string of length > 1). ALGORITHM: If both arguments are strings of non-zero length, ".." will first determine which is the shorter. This length is the "significant length". Any characters after this length in the longer sequence are ignored (return value might be an unthrown exception in this case?) For all remaining characters, each character is considered with respect to its correspondingly indexed character in the other string the following algorithm is applied to determine the range that they represent (the LHS character is referred to as "A", below and the RHS as "B") The binary Unicode general category properties of A and B are considered from the set of major category classes: L, M, N, P, S, Z, C Thus the Lu property or Pe property would be considered. The total range consists of all codepoints lying between the lower of the two codepoints and the higher of the two, inclusive, which share either the major and minor Unicode general category property of A and B (if there is no minor subclass, then codepoints without a minor subclass are considered with respect to that endpoint). The ordering is determined by the ordering of A and B. The range is then restricted to codepoints which share the same script as A or B. Thus, latin "a" and greek lowercase pi would define a range which included all lower-case letters from the Latin and Greek scripts that fell between their codepoints. Having established this range for each correspondingly indexed letter, the range for multi-character strings is defined by a left-significant counting sequence. For example: "Ab" .. "Be" defines the ranges: <A B> and <b c d e> This results in a counting sequence (with the most significant character on the left) as follows: <Ab Ac Ad Ae Bb Bc Bd Be> Currently, Rakudo produces this: "Ab", "Ac", "Ad", "Ae", "Af", "Ag", "Ah", "Ai", "Aj", "Ak", "Al", "Am", "An", "Ao", "Ap", "Aq", "Ar", "As", "At", "Au", "Av", "Aw", "Ax", "Ay", "Az", "Ba", "Bb", "Bc", "Bd", "Be" which I don't think is terribly useful. Many useful results from this suggested change: "C" .. "A" = <C B A> (Rakudo: <>) "(" .. "}" = <( ) [ ] { }> (because open-paren is Pe and close-brace is Ps, therefore all Pe and Ps codepoints in the range are included). "Α" .. "Ω" = <Α Β Γ Δ Ε Ζ Η Θ Ι Κ Λ Μ Ν Ξ Ο Π Ρ Σ Τ Υ Φ Χ Ψ Ω> (notice that codepoint U+03A2 is gracefully skipped, as it is undefined and thus has no properties). "apple" .. "orange" = the counting sequence defined by the ranges "a" .. "o", "p" .. "r", "p" .. "a", "l" .. "n", "e" .. "g" (notice that the string "orang" will be part of the result set, but "orange" will not.) In addition: One alternative to truncation of strings of differing lengths is to extend the sequence. For example, if we ask for "a" .. "bc", then we might produce <a b ac bc>. Where the extension is the original range plus the same range where each element has the extended string elements concatenated. This might even be iterated for every additional codepoint in the longer string. For example: "a" .. "bcd" = <a b ac bc acd bcd> "..." could have similar semantics. In the case of A, B ... C, for length 1 strings, the range A .. B is simply projected forward to until x ge C (if A..B is increasing, le otherwise). C's properties probably should not be considered at all. In the case of length > 1 strings each character index is projected forward independently until any one character index ge the corresponding index in the terminator, and there is no "counting": "AAA", "BCD" ... "GGG" = <AAA BCD CEG> If any index in the sequence does not increment (e.g. "AA", "AB" ... "ZZ") then there is an implication that counting is required. You should be able, in this case, to imply incrementing the left or right side as most significant (e.g. "AA", "BA" ... "ZZ" is also valid). It is, however, an error to try to increment indexes in any other ordering (e.g. "AAA", "ABA" ... "ZZZ"). Once a counting sequence has been established, lookahead must be employed to determine the extent of the range (e.g. "A", "B" can continue through all "Latin" Lu codepoints, so in order to know when to cycle, you must determine how many codepoints lie in the full range. This implies that length > 1 strings in "..." operations which imply a counting sequence, are not strictly evaluated lazily, though some laziness may still be employed. -- Aaron Sherman Email or GTalk: a...@ajs.com http://www.ajs.com/~ajs