On Thu, 28 Aug 2014 10:15:40 -0700, Glenn Linderman <v+pyt...@g.nevcal.com> wrote: > On 8/28/2014 12:30 AM, MRAB wrote: > > On 2014-08-28 05:56, Glenn Linderman wrote: > >> On 8/27/2014 6:08 PM, Stephen J. Turnbull wrote: > >>> Glenn Linderman writes: > >>> > On 8/26/2014 4:31 AM, MRAB wrote: > >>> > > On 2014-08-26 03:11, Stephen J. Turnbull wrote: > >>> > >> Nick Coghlan writes: > >>> > >>> > > How about: > >>> > > > >>> > > replace_surrogate_escapes(s, replacement='\uFFFD') > >>> > > > >>> > > If you want them removed, just pass an empty string as the > >>> > > replacement. > >>> > >>> That seems better to me (I had too much C for breakfast, I think). > >>> > >>> > And further, replacement could be a vector of 128 characters, to do > >>> > immediate transcoding, > >>> > >>> Using what encoding? > >> > >> The vector would contain the transcoding. Each lone surrogate would map > >> to a character in the vector. > >> > >>> If you knew that much, why didn't you use > >>> (write, if necessary) an appropriate codec? I can't envision this > >>> being useful. > >> > >> If the data format describes its encoding, possibly containing data from > >> several encodings in various spots, then perhaps it is best read as > >> binary, and processed as binary until those definitions are found. > >> > >> But an alternative would be to read with surrogate escapes, and then > >> when the encoding is determined, to transcode the data. Previously, a > >> proposal was made to reverse the surrogate escapes to the original > >> bytes, and then apply the (now known) appropriate codec. There are not > >> appropriate codecs that can convert directly from surrogate escapes to > >> the desired end result. This technique could be used instead, for > >> single-byte, non-escaped encodings. On the other hand, writing specialty > >> codecs for the purpose would be more general. > >> > > There'll be a surrogate escape if a byte couldn't be decoded, but just > > because a byte could be decoded, it doesn't mean that it's correct. > > > > If you picked the wrong encoding, the other codepoints could be wrong > > too. > > Aha! Thanks for pointing out the flaw in my reasoning. But that means it > is also pretty useless to "replace_surrogate_escapes" at all, because it > only cleans out the non-decodable characters, not the incorrectly > decoded characters.
Well, replace would still be useful for ASCII+surrogateescape. Also for cases where the data stream is *supposed* to be in a given encoding, but contains undecodable bytes. Showing the stuff that incorrectly decodes as whatever it decodes to is generally what you want in that case. --David _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com