Re: [Python-Dev] Quick sum up about open() + BOM

Glenn Linderman Fri, 08 Jan 2010 17:50:45 -0800

On approximately 1/8/2010 5:12 PM, came the following characters fromthe keyboard of MRAB:

Glenn Linderman wrote:
On approximately 1/8/2010 3:59 PM, came the following characters fromthe keyboard of Victor Stinner:
Hi,
Thanks for all the answers! I will try to sum up all ideas here.
One concern I have with this implementation encoding="BOM" is that ifthere is no BOM it assumes UTF-8. That is probably a good assumptionin some circumstances, but not in others.
* It is not required that UTF-16LE, UTF-16BE, UTF-32LE, or UTF-32BEencoded files include a BOM. It is only required that UTF-16 andUTF-32 (cases where the endianness is unspecified) contain a BOM.Hence, it might be that someone would expect a UTF-16LE (or any ofthe formats that don't require a BOM, rather than UTF-8), but bewilling to accept any BOM-discriminated format.
* Potentially, this could be expanded beyond the various Unicodeencodings... one could envision that a program whose data fileshistorically were in any particular national language locale, couldwant to be enhance to accept Unicode, and could declare that theywill accept any BOM-discriminated format, but want to default, in theabsence of a BOM, to the original national language locale that theyhistorically accepted. That would provide a migration path for theirold data files.
So the point is, that it might be nice to have"BOM-otherEncodingForDefault" for each other encoding that Pythonsupports. Not sure that is the right API, but I think it isexpressive enough to handle the cases above. Whether the cases solveactual problems or not, I couldn't say, but they seem like reasonablecases.
It would, of course, be nicest if OS metadata had been invented wayback when, for all OSes, such that all text files were flagged withtheir encoding... then languages could just read the encoding and dothe right thing! But we live in the real world, instead.
What about listing the possible encodings? It would try each in turn
until it found one where the BOM matched or had no BOM:

    my_file = open(filename, 'r', encoding='UTF-8-sig|UTF-16|UTF-8')

or is that taking it too far?

That sounds very flexible -- but in net effect it would only makeillegal a subset of the BOM-containing encodings (those not listed)without making legal any additional encodings other than the non-BOMencoding. Whether prohibiting a subset of BOM-containing encodings is auseful use case, I couldn't say... but my goal would be to included asmany different file encodings on input as possible: without a BOM, thatis exactly 1 (unless there are other heuristics), with a BOM, it is1+all-BOM-containing encodings. Your scheme would permit numbers ofencodings accepted to vary between 1 and 1+all-BOM-containing encodings.

(I think everyone can agree there are 5 different byte sequences thatcan be called a Unicode BOM. The likelihood of them appearing in anyother text encoding created by mankind depends on those other encodings-- but it is not impossible. It is truly up to the application todecide whether BOM detection could potentially conflict with files insome other encoding that would be acceptable to the application.)

So I think it is taking it further than I can see value in, but I'mwilling to be convinced otherwise. I see only a need for detecting BOM,and specifying a default encoding to be used if there is no BOM. Notethat it might be nice to have a specification for using currentencoding=None heuristic -- perhaps encoding="BOM-None" per my originallyproposed syntax. But I'm still not saying that is the best syntax.


--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Quick sum up about open() + BOM

Reply via email to