Nick Coghlan added the comment:

I think moving this forward mainly needs someone with the time and energy 
wrangle a python-ideas/dev discussion to get some additional feedback on the 
API design. As I see it, there are 2 main questions to be resolved:

1. Where to expose these functions

The default location would be the codecs module, as they're closely related to 
the error handlers in that module, and the main reasons for needing to clean 
data at all are handling dirty data produced by an interface that uses 
surrogatepass or surrogateescape when decoding (handle_surrogates, 
handle_surrogateescape), or encoding data for use in a context which doesn't 
correctly handle code points outside the basic multilingual plane 
(handle_astrals).

If added to the codecs module, they could be documented in new sections on 
"Postprocessing decoded text" and "Preprocessing text for encoding".

The main argument against that would be Stephen's one, which is that these 
aren't themselves encoding or decoding operations, but rather internal state 
manipulations on Python strings.

2. The exact function set to be provided.

The three potential data cleaning cases currently being considered:

* process_surrogates: reprocessing all surrogates in the string, including lone 
surrogates and valid surrogate pairs. Such strings may be produced by using the 
"surrogatepass" handler when decoding, or by decomposing astral characters to 
surrogate pairs.
* process_surrogateescape: reprocessing only lone surrogates in the U+DC80 to 
U+DCFF range, with other surrogate pairs or lone surrogates triggering 
UnicodeTranslateError. Such strings may be produced by using the 
"surrogateescape" error handler when decoding.
* process_astrals: reprocessing all code points in the astral plane.

These seem to cover the essentials to me, and I changed the proposed prefix to 
"process_*" based on the idea of documentating them as preprocessing and 
postprocessing steps for encoding and decoding.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue18814>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to