Re: [Nmh-workers] Non-ASCII Characters in bodies and subjects

Ralph Corderoy Wed, 18 Jun 2014 04:09:08 -0700

Hello Ken,

> > The Unix kernel stores filenames as a run of bytes, not including
> > `/' and NUL.
>
> That's not universally true anymore.  Some newer filesystems are
> mandating that filenames are UTF-8 and enforcing normalization rules
> (MacOS X and Solaris are two notable examples).


Thanks, I didn't know.  Haven't used Solaris in years, and never bought
Apple.

> The only way of resolving this is to use the normalization rules for
> Unicode and do filename searching that way;

Sure.

> MacOS X actually rewrites all of the filenames using Normalization
> Form D (all characters in decomposed form, which means the regular
> character followed by the combining accents) and I think that sucks,
> but they didn't ask me.

I think I agree with you.

> Solaris is better; the original bytes are preserved, but lookup is
> done using normalized names so you can't have two filenames with the
> same characters.

What about globbing, especially on Mac OS X?  Given your two examples on
Linux with bash,

    $ touch résumé résumé
    $ ls r?sum?
    résumé
    $ ls r?sum? | recode ..dump
    UCS2   Mne   Description

    0072   r     latin small letter r
    00E9   e'    latin small letter e with acute
    0073   s     latin small letter s
    0075   u     latin small letter u
    006D   m     latin small letter m
    00E9   e'    latin small letter e with acute
    000A   LF    line feed (lf)
    $
    $ ls r??sum??
    résumé
    $

Do you think NFKC would be better, so ? often matches what appears as a
single rune and fi matches ligature ﬁ?

Cheers, Ralph.

_______________________________________________
Nmh-workers mailing list
[email protected]
https://lists.nongnu.org/mailman/listinfo/nmh-workers

Re: [Nmh-workers] Non-ASCII Characters in bodies and subjects

Reply via email to