On Dec 3, 8:10 am, Michael Goerz <[EMAIL PROTECTED]> wrote: > MonkeeSage wrote: > > On Dec 3, 1:31 am, MonkeeSage <[EMAIL PROTECTED]> wrote: > >> On Dec 2, 11:46 pm, Michael Spencer <[EMAIL PROTECTED]> wrote: > > >>> Michael Goerz wrote: > >>>> Hi, > >>>> I am writing unicode stings into a special text file that requires to > >>>> have non-ascii characters as as octal-escaped UTF-8 codes. > >>>> For example, the letter "Í" (latin capital I with acute, code point 205) > >>>> would come out as "\303\215". > >>>> I will also have to read back from the file later on and convert the > >>>> escaped characters back into a unicode string. > >>>> Does anyone have any suggestions on how to go from "Í" to "\303\215" and > >>>> vice versa? > >>> Perhaps something along the lines of: > >>> >>> def encode(source): > >>> ... return "".join("\%o" % ord(c) for c in source.encode('utf8')) > >>> ... > >>> >>> def decode(encoded): > >>> ... bytes = "".join(chr(int(c, 8)) for c in encoded.split('\\')[1:]) > >>> ... return bytes.decode('utf8') > >>> ... > >>> >>> encode(u"Í") > >>> '\\303\\215' > >>> >>> print decode(_) > >>> Í > >>> HTH > >>> Michael > >> Nice one. :) If I might suggest a slight variation to handle cases > >> where the "encoded" string contains plain text as well as octal > >> escapes... > > >> def decode(encoded): > >> for octc in (c for c in re.findall(r'\\(\d{3})', encoded)): > >> encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8))) > >> return encoded.decode('utf8') > > >> This way it can handle both "\\141\\144\\146\\303\\215\\141\\144\\146" > >> as well as "adf\\303\\215adf". > > >> Regards, > >> Jordan > > > err... > > > def decode(encoded): > > for octc in re.findall(r'\\(\d{3})', encoded): > > encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8))) > > return encoded.decode('utf8') > > Great suggestions from both of you! I came up with my "final" solution > based on them. It encodes only non-ascii and non-printables, and stays > in unicode strings for both input and output. Also, low ascii values now > encode into a 3-digit octal sequence also, so that decode can catch them > properly. > > Thanks a lot, > Michael > > ____________ > > import re > > def encode(source): > encoded = "" > for character in source: > if (ord(character) < 32) or (ord(character) > 128): > for byte in character.encode('utf8'): > encoded += ("\%03o" % ord(byte)) > else: > encoded += character > return encoded.decode('utf-8') > > def decode(encoded): > decoded = encoded.encode('utf-8') > for octc in re.findall(r'\\(\d{3})', decoded): > decoded = decoded.replace(r'\%s' % octc, chr(int(octc, 8))) > return decoded.decode('utf8') > > orig = u"blaÍblub" + chr(10) > enc = encode(orig) > dec = decode(enc) > print orig > print enc > print dec
An optimization...in decode() store matches as keys in a dict, so you only do the string replacement once for each unique character... def decode(encoded): decoded = encoded.encode('utf-8') matches = {} for octc in re.findall(r'\\(\d{3})', decoded): matches[octc] = None for octc in matches: decoded = decoded.replace(r'\%s' % octc, chr(int(octc, 8))) return decoded.decode('utf8') Untested... Regards, Jordan -- http://mail.python.org/mailman/listinfo/python-list