Patches item #1313939, was opened at 2005-10-05 17:01 Message generated for change (Comment added) made by loewis You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=305470&aid=1313939&group_id=5470
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Library (Lib) Group: None Status: Open Resolution: None Priority: 5 Submitted By: Walter Dörwald (doerwalter) Assigned to: Nobody/Anonymous (nobody) Summary: Speedup PyUnicode_DecodeCharmap Initial Comment: This patch speeds up PyUnicode_DecodeCharmap() as discussed in the thread: http://mail.python.org/pipermail/python-dev/2005-October/056958.html It makes it possible to pass a unicode string to cPyUnicode_DecodeCharmap() in addition to the dictionary which is still supported. The unicode character at position i in the string is used as the decoded value for byte i. Byte values greater that the length of the string and u"\ufffd" characters in the string are treated as "maps to undefined". ---------------------------------------------------------------------- >Comment By: Martin v. Löwis (loewis) Date: 2005-10-05 20:36 Message: Logged In: YES user_id=21627 For decoding, Walter's code is nearly identical to the fastmap decoder: both use a Py_UNICODE array to represent the map, and both use REPLACEMENT CHARACTER to denote a missing target code. I find the use of U+FFFD highly appropriate, and not at all debatable. None of the existing codecs maps any of its characters to U+FFFD, and I would consider it a bug if one did. REPLACEMENT CHARACTER should only be used if there is no approprate character, so no charmap should claim that the appropriate mapping for some by is that character. That you often use U+FFFD in output to denote unmappable characters is a different issue, indeed, Python's "replace" mode does so. It would continue to do so under this patch. ---------------------------------------------------------------------- Comment By: M.-A. Lemburg (lemburg) Date: 2005-10-05 19:50 Message: Logged In: YES user_id=38388 The patch looks good, but I'd still like to see whether Hye-Shik's fastmap codec wouldn't be a better and more general solution since it seems to also provide good performance for encoding Unicode strings. That said, you should use a non-code point such as 0xFFFE for meaning "undefined mapping". The Unicode replacement character is not a good choice as this is a very valid character which often actually gets used to replace characters for which no Unicode code point is known. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=305470&aid=1313939&group_id=5470 _______________________________________________ Patches mailing list [email protected] http://mail.python.org/mailman/listinfo/patches
