Is the question, what is canonical mapping? I'm going to assume it is, so I can share what I found when I hit much the same issue. This is mostly from memory so let's hope it's right.
Take the word Café. How many Unicode characters is this and what are they? Turns out there are two answers. The last character as seen on screen is a lower case e with an acute accent. Let's ignore C,a,f as they are the same in all answers. First answer: é is 'LATIN SMALL LETTER E WITH ACUTE' (U+00E9). We'll call this "composed". In UTF-8 that's two bytes, 0xC3 0xA9. (This is the answer you'd often get, but it's not the only answer, and not the one Apple filesystems like.) Second answer uses an accent character. These are designed to appear in the same space as another character. So combine "e" and an acute accent (like a floating, slanted apostrophe) and we have "é". This means you could get the same result from the two Unicode characters LATIN SMALL LETTER E (U+0065) COMBINING ACUTE ACCENT (U+0301). We'll call this "decomposed". In UTF-8 that would be 0x65 0xCC 0x81: three bytes, two characters, combine to a single character. (This is the one Apple filesystems like). When you're typing in a word processor, or showing an alert, it hardly matters how you create the e acute. Both look the same. But searching may be a problem (not discussed) as may showing items in alphabetical order (also not discussed). Let's imagine now we have a filename Café. This could be represented in UTF-8 bytes as 0x41 0x61 0x66 0xC3 0xA9 (composed), or as 0x41 0x61 0x66 0x65 0xCC 0x81 (decomposed). But ultimately there needs to be a set of bits on disk, in a directory, saying the name of the file. When searching for a file we could have three choices (a) these two composed/decomposed are separate file names for two distinct files - whose name will look the same (b) these are the same file, which means all file access by name, and searching has to compose or decompose for comparison purposes (c) only one is allowed and the other is rejected or invalid. Where are we? A bit of (b) and a bit of (c). Finder and file dialogs always decompose what is typed, and this is stored as the string of bits giving the file name. It seems that some APIs will automatically decompose their input, and others won't, and we may be in transition [to judge from the bug response]. So for safety, use a method that decomposes. (Unicode define at least two other types of de/composition, not discussed). Apple calls decomposed "canonical". This is fine, except that Unicode refers to both "canonical decomposition" (what Apple filenames need) and "canonical composition" (the opposite). So if handling names via an Apple API made for filenames we are fine to talk of canonical file names. But if handling names with a general Unicode API, we need to understand that this means "canonical decomposition" rather than "canonical composition". On 21 March 2017 at 11:03, <davel...@mac.com> wrote: > > > What Apple suggested is to Unicode-normalize the filename before adding > it to the URL. Did you try doing that? > > I’m trying to find out what that means. _______________________________________________ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to arch...@mail-archive.com