Richard Schulman wrote: > The following program fragment works correctly with an ascii input > file. > > But the file I actually want to process is Unicode (utf-16 encoding). > The file must be Unicode rather than ASCII or Latin-1 because it > contains mixed Chinese and English characters. > > When I run the program below I get an attribute_count of zero, which > is incorrect for the input file, which should give a value of fifteen > or sixteen. In other words, the count function isn't recognizing the > ", characters in the line being read. Here's the program: > > in_file = open("c:\\pythonapps\\in-graf1.my","rU") > try: > # Skip the first line; make the second available for processing > in_file.readline() > in_line = readline() > attribute_count = in_line.count('",') > print attribute_count > finally: > in_file.close() > > Any suggestions? > > Richard Schulman > (For email reply, delete the 'xx' characters)
You're not detecting the file encoding and then using it in the open statement. If you know this is utf-16le or utf-16be, you need to say so in the open. If you don't, then you should read it into a string, go through some autodetect logic, and then decode it with the <string>.decode(encoding) method. A clue: a properly formatted utf-16 or utf-32 file MUST have a BOM as the first character. That's mandated in the unicode standard. If it doesn't have a BOM, then try ascii and utf-8 in that order. The first one that succeeds is correct. If neither succeeds, you're on your own in guessing the file encoding. John Roth -- http://mail.python.org/mailman/listinfo/python-list