On Wed, Jul 28, 2010 at 9:24 PM, Darren Duncan <dar...@darrenduncan.net>wrote:
> Jon Lang wrote: > >> I don't know enough about Unicode to suggest how to solve this. >> >> Thankfully, I know little enough to take up the challenge ;-) > All I can >>> say is that my example above should never return a valid Range object >>> unless >>> there is a way I can specify my own ordering and I use it. >>> >> Please see my suggested approach way, way back at the start of all this. Use Unicode scripts, properties and codepoint sequences to produce a list of codepoints. Want something more meaningful than codepoints? Great, use an object that knows what you're asking for: EnglishDictword("apple") .. EnglishDictWord("orange") It's a very Perl way to approach a problem: provide the solution that meets the least common denominator need (return a range object that represents ranges based on the information we have) and then allow that same feature to be used in cases where the user has provided sufficient context to do something smarter. I don't think it makes sense to extend the length of strings under consideration by default. Obviously the above example would include "blackberry" because you've asked it to consider English dictionary words, but "aa" .. "zz" shouldn't contain "blackberry" because you don't have enough data to understand what's being asked for, and thus should fall back to treating strings as lists of codepoints (speaking of which do we define a behavior for (1,2,3) .. (4,5,6)? Right now, we consider (1,2,7) to be in that range, and I don't think that's a terribly useful result). > >> That actually says something: it says that we may want to reconsider >> the notion that all string values can be sorted. You're suggesting >> the possibility that "a" cmp "ส้" is, by default, undefined. >> > By default, I think it should by +1 because of the codepoint comparison. If you then tell Perl that you want that comparison done in a Thai context, then it's probably -1. The golden rule of Unicode is: never pretend you have more information than you do. > > I think that a general solution here is to accept that there may be more > than one valid way to sort some types, strings especially, and so > operators/routines that do sorting should be customizable in some way so > users can pick the behaviour they want. > And I think that this brings you back to what I was saying at the top of the thread which is that the most basic approach treats each codepoint as a collection of information and sorts on that information first and then the codepoint number itself. If that's not useful to you, tell Perl what you really wanted. > Some possible examples of customization: > > $foo ~~ $a..$b :QuuxNationality # just affects this one test > > $bar = 'hello' :QuuxNationality # applies anywhere the Str value is used > That's a bit too easy to read without thinking about the implications. I bring back my original example from long ago: "TOPIXコンポジット1500構成銘柄" which I shamelessly grabbed from a Tokyo Stock Exchange page. That one string, used in everyday text, contains Latin letters, Hiragana [I lied, there's no Hiragana], Katakana, Han or Kanji idiograms and Latin digits. Now call .succ on that sucker, I dare you, keeping in mind that there's no one "Japanese" script in Unicode. I think the only valid starting point without any contextual information is to essentially treat it as a sequence of codepoints (as if it were an array of integers) and do something marginally sane on that basis. Then you let the user provide you with hints. Yes, it's "Japanese language" but that doesn't tell you as much as you'd hope, since many of the rules come from the languages that Japanese is borrowing from, here. One answer is to break it down on script and major category property boundaries into "TOPIX" (Latin: the name of an index), "コンポジット" (Katakana: phonetically this is "konpozito" or "composite"), "1500" (Latin digits), and "構成銘柄" (Kanji ideographs: constituents). Now, treat each one of those as a separate sequence of codepoints and begin incrementing each sub-sequence in turn. You could also apply Japanese sorting rules to the successor method, but then you get into questions of what the Japanese sorting method is for Latin letters... probably a solved problem, but obscure enough that I'll bet there are edge cases that are NOT solvable just by knowing that the locale because they are finer grained (e.g. which Latin-using language does the word come from? What source language is most appropriate for the context? etc.) Maybe you throw an exception when you try to tell Perl that " TOPIXコンポジット1500構成銘柄" is a Japanese string... but then Perl is rejecting strings that are considered valid in some contexts within that language. My only strongly held belief, here, is that you should not try to answer any of these questions for the default range operator on unadorned, context-less strings. For that case, you must do something that makes sense for all Unicode codepoints in nearly all contexts. -- Aaron Sherman Email or GTalk: a...@ajs.com http://www.ajs.com/~ajs