Hi, Thanks for all the answers! I will try to sum up all ideas here.
(1) Change default open() behaviour or make it optional? Guido would like to add an option and keep open() unchanged. He wrote that checking for BOM and using system locale are too much different to be the same option (encoding=None). Antoine would like to check BOM by default, because both options (system locale vs checking for BOM) is the same thing. About Antoine choice (encoding=None): which file modes would check for a BOM? I would like to answer only the read only mode, but then open(filename, "r") and open(filename, "r+") would behave differently? => 1 point for Guido (encoding="BOM" is more explicit) Antoine choice has the advantage of directly support UTF-8+BOM, UTF-16 and UTF-32 for all applications and all modules using open(filename). => 1 point for Antoine (2) Check for a BOM while reading or detect it before? Everybody agree that checking BOM is an interesting option and should not be limited to open(). Marc-Andre proposed a codecs.guess_file_encoding() function accepting a file name or a binary file-like object: it returns the encoding and seek to the file start or just after the BOM. I dislike this function because it requires extra file operations (open (optional), read() and seek()) and it doesn't work if the file is not seekable (eg. a pipe). I prefer to check for a BOM at first read in TextIOWrapper to avoid extra file operations. Note: I implemented the BOM check in TextIOWrapper; so it's already usable for any file-like object. (3) tell() and seek() on a text file starting with a BOM To be consistent with Antoine example: >>> bio = io.BytesIO(b'\xff\xfea\x00b\x00') >>> f = io.TextIOWrapper(bio, encoding='utf-16') >>> f.read() 'ab' >>> f.seek(0) 0 >>> f.read() 'ab' TextIOWrapper: * tell() should return zero at file start, * seek(0) should go be to file start, * and the BOM should always be "ignored". I mean: with open("utf8bom.txt", encoding="BOM") as fp: assert fp.tell() == 0 text = fp.read() # no BOM here fp.seek(0) assert fp.read() == text -- About my patch: - BOM check is explicit: open(filebame, encoding="BOM") - tell() / seek(0) works as expected - BOM bytes are always skipped in read() / readlines() result Hum, I don't know if this email can be called a sum up ;-) -- Victor Stinner http://www.haypocalc.com/ _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com