Thanks for the detailed explanation. On Wed, Nov 14, 2012 at 9:11 PM, Florian Kainz <ka...@ilm.com> wrote: > David Aguilar wrote: >> >> Is it not easier to treat the data like raw bytes and not care? >> >> I'm in favor of UTF-8 as a recommendation. >> I'm on the fence about enforcing it in the library (it couldn't hurt). >> I am not overly excited about pushing normalization issues into the >> library. >> >> What's the driving benefit of forcing a particular normalization? >> >> The user used a particular form. Why not use it as-is? >> Presumably the rest of their app uses it too, so leaving data as-is >> lets them make the call. > > > I'm not sure that treating the data as raw bytes and not caring is a > good idea. > > Suppose someone hands you an OpenEXR file, and a listing of the header > reveals the following set of channels: > > 공룡.R > 공룡.G > 공룡.B > 배경.R > 배경.G > 배경.B > > In your image processing application you want to extract the first layer > from the file, so you type 공룡. However, you don't know - and you > shouldn't have to know - how the text is encoded in the file: Hangul Jamo, > Hangul syllables (pre-composed Jamo) or a combination of both. In order > to access the correct channel, the name in the file and the name that > was typed in must both be converted into a common, canonical encoding. > Unicode normalization does that.
That makes sense. This is probably the most common use case, so I see how it helps here. In lieu of an encoding header, one form must be chosen, so it's best to go with one. I just wanted to illustrate one tiny use case where not doing auto-normalization could be helpful. Just thinking out loud -- Auto-normalization definitely makes sense for channel and header names. Are there any use cases for raw const char * storage? Header values? > Similarly, if the file already contains a channel called 배경.R, encoded > using Jamo, then it should not be possible to add another channel with > the name 배경.R, but encoded as syllables. Code might not have a problem > distinguishing the two channel names, but people certainly would. The > OpenEXR library should detect an attempt to add two channels with the > same name, and generate an appropriate error message. > > The fact that storing a string in a file and retrieving it may change its > encoding should not be a big problem for application code that is aware of > Unicode, since the application must already be able to handle alternate > encodings of a string. > > Instead of normalizing strings before they are stored in files, the > OpenEXR library could normalize strings on the fly before every string > comparison. That way every string would be preserved exactly. Speed > could be an issue, though. String comparisons are not rare, and on-the-fly > normalization would slow them down considerably. -- David _______________________________________________ Openexr-devel mailing list Openexr-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/openexr-devel