Re: Files, Directories, Resources, Operating Systems

Tom Christiansen Thu, 27 Nov 2008 02:26:49 -0800

In-Reply-To: Message from Darren Duncan <[EMAIL PROTECTED]> 
   of "Wed, 26 Nov 2008 19:34:09 PST." <[EMAIL PROTECTED]>

> Tom Christiansen wrote:

>>  I believe database folks have been doing the same with character data, but
>>  I'm not up-to-date on the DB world, so maybe we have some metainfo about
>>  the locale to draw on there.  Tim?

> AFAIK, modern databases are all strongly typed at least to the point
> that the values you store in and fetch from them are each explicitly
> character data or binary data or numbers or what-have-you; and so,
> when you are dealing with a DBMS in terms of character data, it is
> explicitly specified somewhere (either locally for the data or
> globally/hardcoded for the DBMS) that each value of character data
> belongs to a particular character repertoire and text encoding, and so
> the DBMS knows what encoding etc the character data is in, or at least
> it treats it consistently based on what the user said it was when it
> input the data.

Oh, good then.  That's what I'd heard was happening, but wasn't sure since
I've steared clear of such beasties since before it was true.

I wish our filesystems worked that way.  But Andrew said something to me
last week about Ken and Dennis writing quite pointedly that while you
*could* use the f/s as a database, that you *shouldn't*.  I didn't know
the reference he was thinking of, so just nodded pensively (=thoughtfully).

>>  There is ABSOLUTELY NO WAY I've found to tell whether these utf-8
>>  string should test equal, and when, nor how to order them, without
>>  knowing the locale:
>>  
>>      "RESUME",
>>      "Resume"
>>      "resume"
>>      "Resum\x{e9}"
>>      "r\x{E9}sum\x{E9}"
>>      "r\x{E9}sume\x{301}"
>>      "Re\x{301}sume\x{301}"

>>  Case insensitively, in Spanish they should be identical in all regards.
>>  In French, they should be identical but for ties, in which case you
>>  work your way right to left on the diactricals.

> This leads me to talk about my main point about sensitivity etc.

> I believe that the most important issues here, those having to do with
> identity, can be discussed and solved without unduly worrying about
> matters of collation;

It's funny you should say that, as I could nearly swear that I just showed
that identify cannot be determmined in the examples above without knowing
about locales.  To wit, while all of those sort somewhat differently, even
case-insensitively, no matter whether you're thinking of a French or a
Spanish ordering (and what is English's, anyway?), you have a a more
fundadmental = vs != scenario which is entirely locale-dependent.

If I can make a "RESUME" file, ought I be able to make a distcint
"r\x{E9}sum\x{E9}" or "re\x{301}sume\x{301}" file in a case-ignorant
filesystem? There is no good answer, because we might think it
reasonable to

    lc(strip_marks($old_fn)) eq lc(strip_marks($new_fn))

Theee problem of what is or is not a "mark" varies by locale,

    *  Castilian doesn't think ~ is a mark; Portuguese does, and 
       so if you strip marks, you in Castilian count as the same
       two letters that it deems disinct, but in Portuguese, you
       incur no lasting harm.

    *  Catalan doesn't think ¸ is a mark; French does. and so if you strip
       marks, you in Catalan count as the same two letters that it deems
       disinct, but in French or Portuguese, you incur no lasting harm.

    *  Modern English (usually) decomposes æ into a+e, but OE/AS and
       Icelandic do not.

    *  Moreover, Icelandic deems é and e to be completely
       different letters altogether.  If you strip marks, you 
       count as the same letters that that language does not.
       Similarly with ö, which is at the end of their alphabet,
       (like ø in some), and nowhere near o or ó.  BTW, those
       are three separate letters, not variants.

    *  And in OE/AS you could have a long mark on an asc (say "ash" for the
       atomic *letter* æ).  If split into a and e and stripped of marks, it
       woudn't make any sense at all.

Case in point: Ælene Frisch, whom many of you doubtless know, insists her
name be spelt as I have written it.  She does not want Aelene Frish, for
she considers her forename to have 5 letters in it, not 6.  But Unicode
doesn't give us a title case version of that (did AS?), suggesting it a
ligature not a digraph.  

But if we have a file called "ÆLENE", may be assume it the same in a case-
insensitive sense to both "aelene" and  "ælene"?

I can only go on code-points, because I don't want to deal with ß and SS
and Ss.  Case-folding file systems are just begging for trouble, and I just
don't know what to do.  Think of the 3 Greek sigmata.

> identity is a lot more important than collation, as well as a
> precondition for collation, and collation is a lot more difficult and can
> be put off.

I agree everything with everthing save "and can be put off".  I would like
you to be right.  I should truly wish to be mistaken.  And I don't know
what we have for prior (cough) art.

> respect to dealing with a file system, generally it is just identity that
> matters and collation is a concern that can typically be just tacked on
> after identity is solved.

> That is, with a file system you need to know whether or not a file name
> you hold will or won't match a file in the system, and matching or not-
> matching is the main function of an identity.

But you can't match without knowing locales.  It's NOT just collation. I'll
leave Icelandic out of it, but look at the trouble with 0xDF spilling from
one each to two chars and two bytes in the perl5 regex engine.  Then look
at 0xFF spilling from one char to one char and three bytes there.  It's
just plain horripilating.

> Collation criteria is something that can be naturally applied externally
> to a file system, such as by a user program, and only identity criteria
> needs to be built-in to the file system.

I don't think you can do identify (case-wise) correctly without reguard to
digraphs and a world of weirdnesses we really wish we didn't.  But you know
what else I wonder: what existing art *IS* there?  It's so hard  a problem
that I wonder if any one has done a good job at it.

Talking to the standards geeks at Usenix, including Andrew, brought no joy.
They basically just through up their hands, and lunch.  I really wish I
could talk to Rob Pike and Udi Manber, my old theory and regex prof, but I
think they've both drunk the Googlaide now.  I know Google strips accents
willynilly and does case-insensitive compares, but I don't know if that's a
global sol;ution.

> So collation doesn't need to be considered in Perl's file-system
> interface, while identity does; collation can be a layer on top of the
> core interface that just cares about identity.

That seems a simplified version of reality.  Identity isn't what monoglots
think it is.

> If you *know* that the 7 strings are all UTF-8, then locale doesn't have
> to be considered for equality; just your unicode abstraction level
> matters, such as if you're defining the values in terms of graphemes vs
> codepoints vs bytes.

That's not true.  é is not the same letter as e in Icelandic.

>>  See what a mess it's going into?  Larry, can you think of something
>>  simple?  I haven't been able to.  Unicode solves so few of the problems 
>>  people think it does.  We've still so much to do, and I don't just
>>  mean perlers.

> AFAIK, Unicode does have an answer for the most important problems.

>>  Darren>> To summarize, what we really want is something more generic
>>  Darren>> than case-sensitivity, which is text normalization and text
>>  Darren>> folding in general, as well as distinctly dealing with
>>  Darren>> distinctness for representation versus distinctness for mutual
>>  Darren>> exclusivity.

>>  I think that you might have to use a Unicode::Collator object, since
>>  the standard DUCET.  It doesn't help much for actual locales, but it
>>  does take care of some of things you're concerned with.

> Makes sense.

Yes, I think so too.  But it is very expensive in performance.  Play with
my program.  Makes you want to cheat.

>>  Darren>> [This] implies that sensitivity is special whereas sensitivity
>>  Darren>> should be considered normal, and rather insensitivity should
>>  Darren>> be considered special.

>>  I think Darren may be right, because even case-sensitivity is a real
>>  problem.

> It sure is.

No kidding. :-(

--tom

Re: Files, Directories, Resources, Operating Systems

Reply via email to