Re: Files, Directories, Resources, Operating Systems

Darren Duncan Thu, 27 Nov 2008 11:16:31 -0800

Tom Christiansen wrote:

In-Reply-To: Message from Darren Duncan <[EMAIL PROTECTED]>

 There is ABSOLUTELY NO WAY I've found to tell whether these utf-8
 string should test equal, and when, nor how to order them, without
 knowing the locale:

"RESUME",

     "Resume"
     "resume"
     "Resum\x{e9}"
     "r\x{E9}sum\x{E9}"
     "r\x{E9}sume\x{301}"
     "Re\x{301}sume\x{301}"

I believe that the most important issues here, those having to do with
identity, can be discussed and solved without unduly worrying about
matters of collation;


It's funny you should say that, as I could nearly swear that I just showed
that identify cannot be determmined in the examples above without knowing
about locales.  To wit, while all of those sort somewhat differently, even
case-insensitively, no matter whether you're thinking of a French or a
Spanish ordering (and what is English's, anyway?), you have a a more
fundadmental = vs != scenario which is entirely locale-dependent.

If your current abstraction level is the Unicode codepoint level, then noknowledge of locale is needed at all in an everything-sensitive filesystem.Those 7 examples are all distinct for you, end of story. So you can see why Iadvocate everything-sensitive as being the "normal" case, same as with Perlidentifiers.

Rather than thinking of locales in terms of something special, AFAIK any localecan be reduced to a simple (though possibly verbose but predefinable in alibrary) normalized portable definition built from everything-sensitivecomponents where the components are enumerations and functions describing acharacter repertoire (what characters can exist) plus representationnormalization rules plus where applicable collation (ordering) rules plus whereapplicable mutual exclusion rules.

When your core toolkit just works with everything-sensitive components andinsensitive or locale issues are just defined as formulae over that, then wehave indeed separated the locale issues into a connected but non-core problem.

So collation doesn't need to be considered in Perl's file-system
interface, while identity does; collation can be a layer on top of the
core interface that just cares about identity.


That seems a simplified version of reality.  Identity isn't what monoglots
think it is.

I'm wondering if we're talking about the same meaning of the word "collation".The way I have been using it, or meaning to, "collation" simply talks about howyou put a set of values in order such that each 2 distinct values has abefore|after relationship. Whereas identity is testing whether 2 things youhold are just the same value or not. You don't need to have ordering rulesdefined in order to have known equality rules.

If you *know* that the 7 strings are all UTF-8, then locale doesn't have
to be considered for equality; just your unicode abstraction level
matters, such as if you're defining the values in terms of graphemes vs
codepoints vs bytes.


That's not true.  é is not the same letter as e in Icelandic.

I don't consider those to be the same character period. Mind you everywhereI've said "graphemes" I meant language-independent graphemes.

I grant you that if you get into a further abstraction level oflanguage-dependent graphemes, then some may see those 2 characters as beingidentical, and if that's your point then I can better understand now whereyou're coming from with the problems you raise.

Practically speaking, I think that portability and other concerns would requireus to just not go higher than the language-independent grapheme abstractionlevel when dealing with either Perl identifiers or file names or other urls withnon-platform-specific APIs, and simply treat every language-independent graphemeas being distinct/non-identical from every other one, even if some locales mightdo different. Users should be able to deal with this gracefully enough much aspeople can easily enough treat "E" and "e" as being distinct.


-- Darren Duncan

Re: Files, Directories, Resources, Operating Systems

Reply via email to