After mulling this over a bit more, I think the rule that _all_ strings must be normalized is too restrictive. Strings such as the value of the "comments" attribute are only stored in the file or retrieved from the file. The library performs no other processing, so there's no requirement for normalization.
However, attribute names and channel names are used by the library for table lookups. In those cases normalization is necessary, or the lookup will not work correctly. Comparison with strcmp() can fail when a string has more than one possible representation. I don't see how the onus could be shifted to application code. If a user types a channel name such as ??.R (see my earlier mail), must the application try all possible representations of this string in order to find out if the corresponding channel exists? If the application fails to do this, should the library allow a channel list that contains multiple channels with the name ??.R, the only difference being that in one case ?? is represented as two Hangul syllable characters, in the next ? has been split into Jamo but ? is a syllable, and so on? If a single application generates all the attribute and channel names in a file then we can reasonably assume that the application uses consistent rules for encoding strings throughout its code base. However, applications must be able to handle channel names found in files that may have been generated by other applications, possibly with different internal conventions for text processing. For example, an application that internally represents Korean texts using syllables should be able to handle OpenEXR files that were written by an application that uses Jamo, or any combination of Jamo and syllables. I propose a revised set of rules: - All text strings are to be interpreted as Unicode, encoded as UTF-8. This includes attribute names and strings contained in attributes, for example, as channel names. - Attribute names and channel names stored in files must be in Normalization Form C (NFC, canonical decomposition followed by canonical composition). - Where attribute names or channel names need to be collated, strcmp() is used to compare the corresponding char sequences: string A comes before (or is less than) string B if strcmp(A,B) == -1 (Note: this is not ambigous; the C99 standard specifies that strcmp() interprets the bytes that make up a string as unsigned.) - Attribute names and channel names passed to the IlmImf library must be encoded as UTF-8 and in Normalization Form C. (Note - this last rule could be changed to: Attribute names and channel names must be encoded as UTF-8. The library converts the names to Normalization Form C before any further processing.) Florian _______________________________________________ Openexr-devel mailing list Openexr-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/openexr-devel