On Wed, Jul 28, 2010 at 9:24 PM, Darren Duncan <dar...@darrenduncan.net>wrote:

> Jon Lang wrote:
>
>> I don't know enough about Unicode to suggest how to solve this.
>>
>>
Thankfully, I know little enough to take up the challenge ;-)


>  All I can
>>> say is that my example above should never return a valid Range object
>>> unless
>>> there is a way I can specify my own ordering and I use it.
>>>
>>
Please see my suggested approach way, way back at the start of all this. Use
Unicode scripts, properties and codepoint sequences to produce a list of
codepoints. Want something more meaningful than codepoints? Great, use an
object that knows what you're asking for:

   EnglishDictword("apple") .. EnglishDictWord("orange")

It's a very Perl way to approach a problem: provide the solution that meets
the least common denominator need (return a range object that represents
ranges based on the information we have) and then allow that same feature to
be used in cases where the user has provided sufficient context to do
something smarter.

I don't think it makes sense to extend the length of strings under
consideration by default. Obviously the above example would include
"blackberry" because you've asked it to consider English dictionary words,
but "aa" .. "zz" shouldn't contain "blackberry" because you don't have
enough data to understand what's being asked for, and thus should fall back
to treating strings as lists of codepoints (speaking of which do we define a
behavior for (1,2,3) .. (4,5,6)? Right now, we consider (1,2,7) to be in
that range, and I don't think that's a terribly useful result).



>
>> That actually says something: it says that we may want to reconsider
>> the notion that all string values can be sorted.  You're suggesting
>> the possibility that "a" cmp "ส้" is, by default, undefined.
>>
>

By default, I think it should by +1 because of the codepoint comparison. If
you then tell Perl that you want that comparison done in a Thai context,
then it's probably -1.

The golden rule of Unicode is: never pretend you have more information than
you do.



>
> I think that a general solution here is to accept that there may be more
> than one valid way to sort some types, strings especially, and so
> operators/routines that do sorting should be customizable in some way so
> users can pick the behaviour they want.
>

And I think that this brings you back to what I was saying at the top of the
thread which is that the most basic approach treats each codepoint as a
collection of information and sorts on that information first and then the
codepoint number itself. If that's not useful to you, tell Perl what you
really wanted.



> Some possible examples of customization:
>
>  $foo ~~ $a..$b :QuuxNationality  # just affects this one test
>
>  $bar = 'hello' :QuuxNationality  # applies anywhere the Str value is used
>

That's a bit too easy to read without thinking about the implications. I
bring back my original example from long ago:

"TOPIXコンポジット1500構成銘柄" which I shamelessly grabbed from a Tokyo Stock
Exchange page. That one string, used in everyday text, contains Latin
letters, Hiragana [I lied, there's no Hiragana], Katakana, Han or Kanji
idiograms and Latin digits.


Now call .succ on that sucker, I dare you, keeping in mind that there's no
one "Japanese" script in Unicode. I think the only valid starting point
without any contextual information is to essentially treat it as a sequence
of codepoints (as if it were an array of integers) and do something
marginally sane on that basis. Then you let the user provide you with hints.
Yes, it's "Japanese language" but that doesn't tell you as much as you'd
hope, since many of the rules come from the languages that Japanese is
borrowing from, here.

One answer is to break it down on script and major category property
boundaries into "TOPIX" (Latin: the name of an index), "コンポジット" (Katakana:
phonetically this is "konpozito" or "composite"), "1500" (Latin digits), and
"構成銘柄" (Kanji ideographs: constituents). Now, treat each one of those as a
separate sequence of codepoints and begin incrementing each sub-sequence in
turn. You could also apply Japanese sorting rules to the successor method,
but then you get into questions of what the Japanese sorting method is for
Latin letters... probably a solved problem, but obscure enough that I'll bet
there are edge cases that are NOT solvable just by knowing that the locale
because they are finer grained (e.g. which Latin-using language does the
word come from? What source language is most appropriate for the context?
etc.)

Maybe you throw an exception when you try to tell Perl that "
TOPIXコンポジット1500構成銘柄" is a Japanese string... but then Perl is rejecting
strings that are considered valid in some contexts within that language.

My only strongly held belief, here, is that you should not try to answer any
of these questions for the default range operator on
unadorned, context-less strings. For that case, you must do something that
makes sense for all Unicode codepoints in nearly all contexts.

-- 
Aaron Sherman
Email or GTalk: a...@ajs.com
http://www.ajs.com/~ajs

Reply via email to