Marc-Andre Lemburg <m...@egenix.com> added the comment: Thanks for the patch, Victor.
Some comments on the patch: * the codec will have to be able to work with lone surrogates (see the wikipedia page explaining this detail), which the UTF-8 codec in Python 3.x no longer does, so another special case is due for this difference * we should not make the standard UTF-8 codec slower just to support a variant of UTF-8 which will only get marginal use; for the decoder, the changes are minimal, so that's fine, but for the decoder you are changing the most often used code branch to check for NUL bytes - we need a better solution for this, even if it means having to use a separte encode_utf8java function Since the ticket was opened in 2008, the common name of the codec appears to have changed from "UTF-8 Java" to "Modified UTF-8" or "MUTF-8" as short alias: * http://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8 (change in http://en.wikipedia.org/w/index.php?title=UTF-8&diff=next&oldid=291829304) * http://java.sun.com/developer/technicalArticles/Intl/Supplementary/ (scroll down to "Modified UTF-8") * http://developer.android.com/reference/java/io/DataInput.html (this is for Android) So I guess we should adapt to the name to the now common name and call it "ModifiedUTF8" in the C API and add these aliases: "utf-8-modified", "mutf-8" and "modified-utf-8". ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue2857> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com