Marc-Andre Lemburg added the comment: On 23.09.2014 13:12, Nick Coghlan wrote: > > Nick Coghlan added the comment: > > Draft docstring for that version > > def convert_surrogates(data, errors='replace'): > """Convert escaped surrogates by applying a different error handler > > Uses the "replace" error handler by default, but any input > error handler may be specified. > """ > return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors)
Nick, the doc string is not correct. It is not working on escaped surrogates. Instead it is working on lone surrogates that were used to encode undecodable bytes from some input data. The longer story goes like this: The "surrogateescape" error handler in the .decode() call that lead up to the data you want this function to take as input, will convert undecodable data to lone low surrogates. The function then reverts these bytes back into UTF-8 (which may well not be the original encoding, as Antoine has already pointed out, but that's not really important for the use case), recreating the unencodable bytes and then decodes the result again using the UTF-8 codec using a new error handler. So in summary, the function is supposed to retroactively apply a different error handler to the input data, undoing the effects of the "surrogateescapes" error handler. The name still doesn't match this functionality. BTW: There's a catch in the approach. The encoding used to decode the original data may well be 'ascii'. Now, if the original input data was in fact UTF-8, the input decoding would have mapped the UTF-8 code points to lone surrogates. The above function would then turn these back into UTF-8, redecode and get a completely different string back (since the error handlers would not trigger). I'm not sure whether adding such a small function with so many unclear implications is a good idea. Either it should be made more specific, e.g. be reserved for use on data from input streams with known encoding, or be put into the documentation as example for people to use and adapt as necessary. ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue18814> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com