Steven D'Aprano writes: > On Sat, Jan 23, 2021 at 12:40:55AM -0500, Random832 wrote: > > On Fri, Jan 22, 2021, at 20:34, Inada Naoki wrote: > > > * Default encoding is "utf-8". > > > > it might be worthwhile to be a little more sophisticated than this. > > > > Notepad itself uses character set detection [it might not be > > reasonable to do this on the whole file as notepad does, but maybe the > > first 512 bytes, or the result of read1(512)?] when opening a file of > > unknown encoding, and msvcrt's "ccs=UTF-8" option to fopen will at > > least detect at the presence of UTF-8 and UTF-16 BOMs [and treat the > > file as UTF-16 in the latter case]. > > > I like Random's idea. If we add a new "open text file" builtin > function, we should seriously consider having it attempt to > auto-detect the encoding. It need not be as sophisticated as > `chardet`.
It definitely should not be as sophisticated as chardet. Detection of ISO 8859, ISO 2022, and EUC family encodings is reliable as long as you know that only one of each family is going to be used. But you cannot easily tell which of the many ISO 8859 (also Windows-12xx) family are present, and similarly for the other families. I see very little use in detecting the BOMs. I haven't seen a UTF-16 BOM in the wild in a decade (as usual for me, that's Japan-specific, and may be limited to the academic community as well), and the UTF-8 BOM is a no-op if the default is UTF-8 anyway. I'm definitely leaning to the suggestion I made elsewhere (if it's adopted at all): force UTF-8, and name it 'open_utf8'. Steve _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/LPUM3JPQD3RJCYFZ42GWTISCAHKF462C/ Code of Conduct: http://python.org/psf/codeofconduct/