Bugs item #1377394, was opened at 2005-12-09 22:43 Message generated for change (Comment added) made by lemburg You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1377394&group_id=5470
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Unicode Group: Python 2.4 >Status: Closed Resolution: None Priority: 5 Submitted By: superwesman (superwesman) Assigned to: M.-A. Lemburg (lemburg) Summary: read() / readline() blow up if file has even number of char. Initial Comment: Hello, I am having a problem with the read() and readline() functions. I'm using codecs.open() to open a text file, then using either read() or readline() to get its contents. In python 2.4.2, if the file has an even number of characters, I get a UnicodeDecodeError. If python 2.4.1 this works regardless of the character count. I've pasted below a sample script and the sample text file I was running. This is the command I executed at the Windows 2000 CMD prompt: python sample.py sample.txt Again, in 2.4.1, this works fine - in 2.4.2 it breaks when the file-to-be-read has an odd number of characters. Thanks. -w # start: sample.py import codecs import sys print "open the file" in_file = codecs.open( sys.argv[1], "r", "unicode_internal" ) print "read the file" the_file = in_file.read() print "close the file" in_file.close() print "done" # end: sample.py # start: sample.txt RESULTHOST=vivaldi RESULTPORT=a DB_XML=/test/art/jfw/config/DBList.xml LOGCHECK_IGNORE=art_actions.txt # end: sample.txt ---------------------------------------------------------------------- >Comment By: M.-A. Lemburg (lemburg) Date: 2005-12-12 14:39 Message: Logged In: YES user_id=38388 Closing this bug report as "won't fix" (even though SF seems to have removed this option from the tracker, or at least I don't see it in Firefox). Removing "unicode_internal" from the docs is not an option: this is a valid encoding, albeit one that depends on the way Python is built. ---------------------------------------------------------------------- Comment By: Walter Dörwald (doerwalter) Date: 2005-12-12 14:30 Message: Logged In: YES user_id=89016 With the Python 2.4.2 I get the following output both on Linux and Windows: open the file read the file close the file done This is totally independent of the type of line feeds in sample.txt or the length of the file (even or odd). > If it is a valid option (that should only be used > "Python internally" - not sure what that means) > then it should perform consistently regardless > of the number of characters in the file, should it not? unicode_internal just dumps the data bytes of the Unicode object. This means that (depending on the way Python is compiled) the length of a unicode_internal encoded byte string will always be a multiple of 2 or 4. So a byte string that has on odd number of bytes clearly is broken and decoding would have the right to complain about that. In 2.4.2 it doesn't, because it's not clear to the StreamReader API if there's more data available on subsequent calls to read() (and the last odd byte is silently dropped). BTW, the data read by your script is probably not what you might have expected. On a UCS-2 build the result is: u'\u2023\u7473\u7261\u3a74\u7320\u6d61\u6c70\u2e65\u7874\u0a74\u4552\u5553\u544c\u4f48\u5453\u763d\u7669\u6c61\u6964\u520a\u5345\u4c55\u5054\u524f\u3d54\u0a61\u4244\u585f\u4c4d\u2f3d\u6574\u7473\u612f\u7472\u6a2f\u7766\u632f\u6e6f\u6966\u2f67\u4244\u694c\u7473\u782e\u6c6d\u4c0a\u474f\u4843\u4345\u5f4b\u4749\u4f4e\u4552\u613d\u7472\u615f\u7463\u6f69\u736e\u742e\u7478' (or something similar depending on your line feeds). ---------------------------------------------------------------------- Comment By: Reinhold Birkenfeld (birkenfeld) Date: 2005-12-10 11:57 Message: Logged In: YES user_id=1188172 I'd suggest unicode_internal to be removed from the docs. ---------------------------------------------------------------------- Comment By: superwesman (superwesman) Date: 2005-12-10 00:17 Message: Logged In: YES user_id=1401447 I didn't realize that 'unicode_internal' was not a legitimate value to pass into this function. If 'unicode_internal' is not a valid 3rd parameter to codecs.open(), shouldn't that function complain? If it is a valid option (that should only be used "Python internally" - not sure what that means) then it should perform consistently regardless of the number of characters in the file, should it not? Seems to me that pilot-error uncovered a bug. If this is not a valid choice, then codecs.open() should complain. If it is valid, it should perform consistently, IMHO. ---------------------------------------------------------------------- Comment By: M.-A. Lemburg (lemburg) Date: 2005-12-09 23:04 Message: Logged In: YES user_id=38388 Why would you want to read a file using the Python internal Unicode encoding (unicode_internal) ? This is an encoding that is only used Python internally and should not be used for anything else. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1377394&group_id=5470 _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com