On Tue, Mar 22, 2011 at 7:29 PM, Panu Matilainen <pmati...@laiskiainen.org> wrote: > The bindings cannot go changing header contents to their liking, so any > canonicalization would have to go into rpm proper, the build-side of things > to be exact so the runtime doesn't have to care. Requiring rpm to fiddle > with encodings + canonicalization for every single string it processes at > runtime would require enormous changes throughout rpm, and presumably at a > massive performance cost too.
Just a hint from our experience with APIs like os/email/urllib.parse: you pretty much end up *needing* to have parallel bytes and str APIs (including higher level data structures that know how to encode and decode themselves) to get this to work properly. The str APIs will work 90% of the time, but you still need access to the raw bytes to recover when the simple approach fails. One key choice to be made is whether to go the brittle option (i.e. ASCII) for the implicit decoding, or the permissive one (i.e. UTF-8 with surrogateescape). The former punts on the complicated encoding issues (e.g. urllib.parse does this, since correctly formed URLs are meant to be encoded into pure ASCII), while the latter works by default in more situations, but can allow malformed data to escape the IO layer and cause problems in other parts of the program (e.g. many of the os APIs do this, since real world applications often care more about round tripping correctly between different OS interfaces). Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com