On Tue, Jun 22, 2010 at 1:07 PM, James Y Knight <f...@fuhm.net> wrote:
> The surrogateescape method is a nice workaround for this, but I can't help > thinking that it might've been better to just treat stuff as > possibly-invalid-but-probably-utf8 byte-strings from input, through > processing, to output. It seems kinda too late for that, though: next time > someone designs a language, they can try that. :) > surrogateescape does help a lot, my only problem with it is that it's out-of-band information. That is, if you have data that went through data.decode('utf8', 'surrogateescape') you can restore it to bytes or transcode it to another encoding, but you have to know that it was decoded specifically that way. And of course if you did have to transcode it (e.g., text.encode('utf8', 'surrogateescape').decode('latin1')) then if you had actually handled the text in any way you may have broken it; you don't *really* have valid text. A lazier solution feels like it would be easier and more transparent to work with. But... I also don't see any major language constraint to having another kind of string that is bytes+encoding. I think PJE brought up a problem with a couple coercion aspects. -- Ian Bicking | http://blog.ianbicking.org
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com