[issue18814] Add tools for "cleaning" surrogate escaped strings

Nick Coghlan Sun, 24 Aug 2014 06:00:51 -0700

Nick Coghlan added the comment:

The redecode thing is a distraction from my core concern here, so I've split 
that out to issue #22264, a separate RFE for a "wsgiref.fix_encoding" function.


For this issue, my main concern is the function to *clean* a string of escaped 
binary data, so it can be displayed easily, or otherwise purged of the escaped 
characters. Preserving the data by default is good, but you have to know a 
*lot* about how Python 3 works in order to be able figure out how to clean it 
out.

For that, not knowing Unicode in general isn't the problem: it's not knowing 
PEP 383. If we forget the idea of exposing the constant with the escaped values 
(I agree that's not very useful), it suggests "codecs.clean_surrogate_escapes" 
as a possible name:


    # Helper to ensure a string contains no escaped surrogates
    # This allows it to be safely encoded without surrogateescape
    _extended_ascii = bytes(range(128, 256))
    _escaped_surrogates = _extended_ascii.decode('ascii',
                                    errors='surrogateescape')
    _match_escaped = re.compile('[{}]'.format(_escaped_surrogates))
    def clean_surrogate_escapes(s, repl='\ufffd'):
        return _match_escaped.sub(repl, s)

A more efficient implementation in C would also be fine, this is just an easy 
way to define the exact semantics.

(I also just noticed that unlike other error handlers, surrogateespace and 
surrogatepass do not have corresponding codecs.surrogateescape_errors and 
codecs.surrogatepass_errors functions)

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue18814>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue18814] Add tools for "cleaning" surrogate escaped strings

Reply via email to