New submission from Nathaniel Smith <n...@pobox.com>: cPickle.dump by default does not properly encode unicode characters outside the BMP -- it throws away the high bits:
>>> cPickle.loads(cPickle.dumps(u"\U00012345")) u'\u2345' The problem is in dump, not load: >>> pickle.dumps(u"\U00012345") # works 'V\\U00012345\np0\n.' >>> cPickle.dumps(u"\U00012345") # no! 'V\\u2345\n.' It does work correctly when using a more modern pickling protocol: >>> cPickle.loads(cPickle.dumps(u"\U00012345", 1)) u'\U00012345' >>> cPickle.loads(cPickle.dumps(u"\U00012345", 2)) u'\U00012345' But this is not much comfort for users whose data has been corrupted because they went with the defaults. (Fortunately in my application I knew that all my characters were in the supplementary plane, so I could repair the data after the fact, but...) Above tests are with 2.5.2, but from checking the source, the bug is obviously still present in 2.6.1: cPickle.c:modified_EncodeRawUnicodeEscape has no code to handle 32-bit unicode values. OTOH, it does look like someone noticed the problem and fixed it for 3.0; _pickle.c:raw_unicode_escape handles such characters fine. Guess they just forgot to backport the fixes... but the code is there, and can probably just be copy-pasted back to 2.6. ---------- components: Library (Lib) messages: 78230 nosy: njs severity: normal status: open title: cPickle corrupts high-unicode strings versions: Python 2.5, Python 2.6 _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue4730> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com