Ian Bicking: > I think the use case everyone has in mind here is where > you get a URL from one of these sources, and you want to handle it. I have > a hard time imagining the sequence of events that would lead to mojibake. > Naive parsing of a document in bytes couldn't do it, because if you have a > non-ASCII-compatible document your ASCII-based parsing will also fail (e.g., > looking for b'href="(.*?)"').
It depends on what the particular ASCII-based parsing is doing. For example, the set of trail bytes in Shift-JIS includes the same bytes as some of the punctuation characters in ASCII as well as all the letters. A search or split on '@' or '|' may find the trail byte in a two-byte character rather than a true occurrence of that character so the operation 'succeeds' but produces an incorrect result. Over time, the set of trail bytes used has expanded - in GB18030 digits are possible although many of the most important characters for parsing such as ''' "#%&.?/''' are still safe as they may not be trail bytes in the common double-byte character sets. Neil _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com