aurora wrote: > I have some unicode string with some characters encode using python > notation like '\n' for LF. I need to convert that to the actual LF > character. There is a 'unicode_escape' codec that seems to suit my purpose. > >>>> encoded = u'A\\nA' >>>> decoded = encoded.decode('unicode_escape') >>>> print len(decoded) > 3 > > Note that both encoded and decoded are unicode string. I'm trying to > use the builtin codec because I assume it has better performance that > for me to write pure Python decoding. But I'm not converting between > byte string and unicode string. > > However it runs into problem in some cases. > > encoded = u'€\\n€' > decoded = encoded.decode('unicode_escape') > Traceback (most recent call last): > File "g:\bin\py_repos\mindretrieve\trunk\minds\x.py", line 9, in ? > decoded = encoded.decode('unicode_escape') > UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in > position 0: ordinal not in range(128)
Does this do what you want? >>> u'€\\n€' u'\x80\\n\x80' >>> len(u'€\\n€') 4 >>> u'€\\n€'.encode('utf-8').decode('string_escape').decode('utf-8') u'\x80\n\x80' >>> len(u'€\\n€'.encode('utf-8').decode('string_escape').decode('utf-8')) 3 Basically, I convert the unicode string to bytes, escape the bytes using the 'string_escape' codec, and then convert the bytes back into a unicode string. HTH, STeVe -- http://mail.python.org/mailman/listinfo/python-list