David J Birnbaum wrote: > Dear Python-list, > > I need to read a Unicode (utf-8) file that contains text like: > > blah \fR40\fC blah > I get my input and then process it with something like: > > inputFile = codecs.open(sys.argv[1],'r', 'utf-8') > > > > for line in inputFile: > When Python encounters the "\f" substring in an input line, it wants to > treat it as an escape sequence representing a form-feed control > character,
Even if it were as sentient as "wanting" to muck about with the input, it doesn't. Those escape sequences are interpreted by the compiler, and in other functions (e.g. re.compile) but *not* when reading a text file. Example: |>>> guff = r"blah \fR40\fC blah" |>>> print repr(guff) 'blah \\fR40\\fC blah' |>>> # above is ASCII so it is automatically also UTF8 Comment: It contains backslash followed by 'f' ... |... fname = "guff.utf8" |>>> f = open(fname, "w") |>>> f.write(guff) |>>> f.close() |>>> import codecs |>>> f = codecs.open(fname,'r', 'utf-8') |>>> guff2 = f.read() |>>> print guff2 == guff |True No interpretation of the r"\f" has been done. > which means that it gets interpreted as (or, from my > perspective, translated to) "\x0c". Were I entering this string myself > within my program code, I could use a raw string (r"\f") to avoid this > translation, but I don't know how to do this when I am reading a line > from a file. What I suggest you do is: print repr(open('yourfile', 'r').read() [or at least one of the offending lines] and inspect it closely. You may find (1) that the file has formfeeds in it or (2) it has r"\f" in in it and you were mistaken about the interpretation or (3) something else. If you maintain (3) is the case, then make up a small example file, show a dump of it using print repr(.....) as above, plus the (short) code where you decode it and dump the result. HTH, John -- http://mail.python.org/mailman/listinfo/python-list