Tom Christiansen wrote:
In-Reply-To: Message from Darren Duncan <[EMAIL PROTECTED]>
 There is ABSOLUTELY NO WAY I've found to tell whether these utf-8
 string should test equal, and when, nor how to order them, without
 knowing the locale:
"RESUME",
     "Resume"
     "resume"
     "Resum\x{e9}"
     "r\x{E9}sum\x{E9}"
     "r\x{E9}sume\x{301}"
     "Re\x{301}sume\x{301}"

I believe that the most important issues here, those having to do with
identity, can be discussed and solved without unduly worrying about
matters of collation;

It's funny you should say that, as I could nearly swear that I just showed
that identify cannot be determmined in the examples above without knowing
about locales.  To wit, while all of those sort somewhat differently, even
case-insensitively, no matter whether you're thinking of a French or a
Spanish ordering (and what is English's, anyway?), you have a a more
fundadmental = vs != scenario which is entirely locale-dependent.

If your current abstraction level is the Unicode codepoint level, then no knowledge of locale is needed at all in an everything-sensitive filesystem. Those 7 examples are all distinct for you, end of story. So you can see why I advocate everything-sensitive as being the "normal" case, same as with Perl identifiers.

Rather than thinking of locales in terms of something special, AFAIK any locale can be reduced to a simple (though possibly verbose but predefinable in a library) normalized portable definition built from everything-sensitive components where the components are enumerations and functions describing a character repertoire (what characters can exist) plus representation normalization rules plus where applicable collation (ordering) rules plus where applicable mutual exclusion rules.

When your core toolkit just works with everything-sensitive components and insensitive or locale issues are just defined as formulae over that, then we have indeed separated the locale issues into a connected but non-core problem.

So collation doesn't need to be considered in Perl's file-system
interface, while identity does; collation can be a layer on top of the
core interface that just cares about identity.

That seems a simplified version of reality.  Identity isn't what monoglots
think it is.

I'm wondering if we're talking about the same meaning of the word "collation". The way I have been using it, or meaning to, "collation" simply talks about how you put a set of values in order such that each 2 distinct values has a before|after relationship. Whereas identity is testing whether 2 things you hold are just the same value or not. You don't need to have ordering rules defined in order to have known equality rules.

If you *know* that the 7 strings are all UTF-8, then locale doesn't have
to be considered for equality; just your unicode abstraction level
matters, such as if you're defining the values in terms of graphemes vs
codepoints vs bytes.

That's not true.  é is not the same letter as e in Icelandic.

I don't consider those to be the same character period. Mind you everywhere I've said "graphemes" I meant language-independent graphemes.

I grant you that if you get into a further abstraction level of language-dependent graphemes, then some may see those 2 characters as being identical, and if that's your point then I can better understand now where you're coming from with the problems you raise.

Practically speaking, I think that portability and other concerns would require us to just not go higher than the language-independent grapheme abstraction level when dealing with either Perl identifiers or file names or other urls with non-platform-specific APIs, and simply treat every language-independent grapheme as being distinct/non-identical from every other one, even if some locales might do different. Users should be able to deal with this gracefully enough much as people can easily enough treat "E" and "e" as being distinct.

-- Darren Duncan

Reply via email to