Re: Unicode filenames with Apple File System and UIManagedDocument

Aandi Inston Tue, 21 Mar 2017 07:21:08 -0700

Is the question, what is canonical mapping? I'm going to assume it is, so I
can share what I found when I hit much the same issue. This is mostly from
memory so let's hope it's right.

Take the word Café. How many Unicode characters is this and what are they?
Turns out there are two answers. The last character as seen on screen is a
lower case e with an acute accent.
Let's ignore C,a,f as they are the same in all answers.  First answer: é is
'LATIN SMALL LETTER E WITH ACUTE' (U+00E9). We'll call this "composed". In
UTF-8 that's two bytes, 0xC3 0xA9. (This is the answer you'd often get, but
it's not the only answer, and not the one Apple filesystems like.)

Second answer uses an accent character. These are designed to appear in the
same space as another character. So combine "e" and an acute accent (like a
floating, slanted apostrophe) and we have "é". This means you could get the
same result from the two Unicode characters LATIN SMALL LETTER E (U+0065)
COMBINING ACUTE ACCENT (U+0301). We'll call this "decomposed". In UTF-8
that would be 0x65 0xCC 0x81: three bytes, two characters, combine to a
single character. (This is the one Apple filesystems like).

When you're typing in a word processor, or showing an alert, it hardly
matters how you create the e acute. Both look the same. But searching may
be a problem (not discussed) as may showing items in alphabetical order
(also not discussed).

Let's imagine now we have a filename Café. This could be represented in
UTF-8 bytes as 0x41 0x61 0x66 0xC3 0xA9 (composed), or as 0x41 0x61 0x66
0x65 0xCC 0x81 (decomposed). But ultimately there needs to be a set of bits
on disk, in a directory, saying the name of the file. When searching for a
file we could have three choices (a) these two composed/decomposed are
separate file names for two distinct files - whose name will look the same
(b) these are the same file, which means all file access by name, and
searching has to compose or decompose for comparison purposes (c) only one
is allowed and the other is rejected or invalid.

Where are we? A bit of (b) and a bit of (c). Finder and file dialogs always
decompose what is typed, and this is stored as the string of bits giving
the file name. It seems that some APIs will automatically decompose their
input, and others won't, and we may be in transition [to judge from the bug
response]. So for safety, use a method that decomposes. (Unicode define at
least two other types of de/composition, not discussed).

Apple calls decomposed "canonical". This is fine, except that Unicode
refers to both "canonical decomposition" (what Apple filenames need) and
 "canonical composition" (the opposite). So if handling names via an Apple
API made for filenames we are fine to talk of canonical file names. But if
handling names with a general Unicode API, we need to understand that this
means "canonical decomposition" rather than "canonical composition".

On 21 March 2017 at 11:03, <davel...@mac.com> wrote:

>
> > What Apple suggested is to Unicode-normalize the filename before adding
> it to the URL. Did you try doing that?
>
> I’m trying to find out what that means.
_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Re: Unicode filenames with Apple File System and UIManagedDocument

Reply via email to