Nick Coghlan added the comment:
As RDM noted, avoiding the use of surrogateescape isn't feasible when we do it
by default on all OS interfaces (including the standard streams when we detect
'ascii' as the filesystem encoding in 3.5+).
This *needs* to be a case that folks can handle without needing to spend years
learning about encodings and error handlers first. That means being able to
tell them "use this documented function to remove the surrogates" rather than
"use this magic incantation that you don't understand, and that other people
may not be able to read".
I know more about Unicode encodings than the average programmer at this point,
yet I still needed to be schooled by true experts in this thread to learn how
to solve the problem properly.
Look at this as an opportunity to encapsulate that knowledge in executable
form, as while the code is short, it is conceptually *very* dense.
If there's a dedicated function, then replacing the encode/decode dance with a
faster pure C alternative also becomes a future possibility (with only a
recipe, there's no opportunity to ever optimise it).
With the additional clarification, it is also clear to me that Antoine is
correct that the encoding needs to be configurable and should default to the
appropriate setting to remove the surrogates from OS provided data.
With that change:
def convert_surrogates(data, encoding=None, errors='replace'):
"""Convert escaped surrogates by applying a different error handler
If no encoding is given, defaults to sys.getfilesystemencoding()
Uses the "replace" error handler by default, but any input
error handler may be specified.
"""
if encoding is None:
encoding = sys.getfilesystemencoding()
return data.encode(encoding, 'surrogateescape').decode(encoding, errors)
Since it's primarily intended for cleaning OS provided data, then I agree
os.convert_surrogates() could be a good choice. It would be appropriate to
reference it from os.fsdecode() as a way to clean escaped data when the
original binary data was no longer available to be decoded again with a
different error handler.
----------
title: Add codecs.convert_surrogateescape to "clean" surrogate escaped strings
-> Add a convert_surrogates function to "clean" surrogate escaped strings
_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue18814>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com