On Wed, Nov 14, 2012 at 2:57 PM, Florian Kainz <ka...@ilm.com> wrote: > > The problem is that a channel or attribute name such as > "grün" could be represented as the character sequence > > 0067 0072 0075 0803 006E > (g, r, u, combining diaeresis, n) > > or as > > 0067 0072 00FC 006E > (g, r, u with diaeresis, n). > > Typographically the representations look identical, but > string comparisons would treat them as different. > I can't imagine users being happy to be told that a file > contains, for example, a "grün" channel of type HALF, and > a "grün" channel of type FLOAT, where the only difference > between the names is how they are represented as Unicode. > > As far as I can tell, either string comparison needs to > perform some normalization on the fly, or the strings that > are compared must already be normalized. > > Yes, normalization is a headache, but with Unicode there is > not a one-to-one correspondence between the character sequence > stored in a string and the typographical representation of > that string.
I understand the point of normalization, but I do not think it is the responsibility of the library. From the POV of an application -- if they are handing one unicode representation to the library, and then ask the library for what it has it will then give a different answer. That would be a hard bug to track down. Similarly, someone who stores filenames in headers and expects to get back byte-for-byte identical strings will run into problems when they find that the filenames do not exist (because they use a different form). Is it not easier to treat the data like raw bytes and not care? I'm in favor of UTF-8 as a recommendation. I'm on the fence about enforcing it in the library (it couldn't hurt). I am not overly excited about pushing normalization issues into the library. What's the driving benefit of forcing a particular normalization? The user used a particular form. Why not use it as-is? Presumably the rest of their app uses it too, so leaving data as-is lets them make the call. > Florian > > > > David Aguilar wrote: >> >> On Wed, Nov 14, 2012 at 11:47 AM, Florian Kainz <ka...@ilm.com> wrote: >>> >>> The ACES image container specification, meant to be compatible OpenEXR, >>> prescribes UTF-8 for the representation of strings. Therefore I suggest >>> that OpenEXR adopt the following rules: >>> >>> - All text strings are to be interpreted as Unicode, encoded as UTF-8. >>> This includes attribute names and strings contained in attributes, >>> for example, as channel names. >>> >>> - Text strings stored in files must be in Normalization Form C (NFC, >>> canonical decomposition followed by canonical composition). >> >> >> I would stay far away from dealing with normalization issues. >> >> Poke around on OS X and its broken HFS filesystem to see why: >> >> http://radsoft.net/rants/20080405,00.shtml >> >> If the library verified utf-8 that would be enough IMO. >> >> Imagine some poor sucker who goes and stores unicode filenames in a >> header. It's not fun to have a library silently "fix" things for you. >> >> What's the upside of doing the normalization? How about just leave it >> as-is? That way the code can stay simple. Whatever you put in can be >> byte-for-byte identical to what you get out. >> >> Other then that, UTF-8 all the way as the "recommended" encoding. >> >>> - Where text strings need to be collated, strcmp() is used to compare >>> the corresponding char sequences: string A comes before (or is less >>> than) string B if >>> >>> strcmp(A,B) == -1 >>> >>> (Note: this is not ambigous; the C99 standard specifies that strcmp() >>> interprets the bytes that make up a string as unsigned.) >>> >>> - Text strings passed to the IlmImf library must be encoded as UTF-8 >>> and in Normalization Form C. >>> >>> As far as I can tell, these rules are entirely compatible with all >>> existing versions of the IlmImf library. Users whose writing system >>> includes non-ASCII Unicode characters can continue to employ the >>> existing library versions without change. >>> >>> Future versions of the library should verify that text strings are >>> valid UTF-8. In addition, the library should either verify that >>> strings are normalized to NFC, or normalize to NFC on the fly. >> >> >> If we treat them like raw bytes then we really don't care about the >> encoding, do we? (that's why I said, "recommended") >> >> It would be nice if the thing stayed agnostic. >> >> Is there a reason why it needs to enforce the encoding, >> or is a strong recommendation to use UTF-8 good enough? -- David _______________________________________________ Openexr-devel mailing list Openexr-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/openexr-devel