On 24/01/2010 14:23, Stephen J. Turnbull wrote:
Michael Foord writes:

  >  This is why I'm keen that by *default* Python should honour the UTF8
  >  signature when reading files;

Unfortunately, your caveat about "a lot of the time it will *seem* to
work" applies to this as well.  The only way that "honoring
signatures" really works is if Python simply uses the UTF-8 codec on
input and output by default, regardless of locale.  Or perhaps if by
default Python should error out unless a signature is found.

When reading text files the presence of the UTF-8 signature *almost invariably* means a UTF-8 encoding. Honouring this will almost always be better than using the wrong encoding. Of course there are caveats, but it will be a substantial improvement.


Autodetection (ie, doing something different depending on the presence
or absence of the signature) does not really work, because for it to
work correctly, it needs to imply automatic resetting of the output
codec as well.  So what is your naive programmer supposed to expect
when writing a cat program?  Should the first encoding detected or
defaulted determine the output codec?  The last one?  UTF-8 uber
alles?
Unless you keep the information about the original encoding along with the decoded string changing the (default0 output encoding depending on the input is simply not possible - and so not really relevant.


Michael
Such autodetection *can* be done fairly accurately.  After 20 years of
experimenting, Emacs has it pretty much right.  But ... Emacs almost
never runs without a human watching it.  And the code that handles
this is a mess of special cases and heuristics.  Not to mention
throwing more than a few exceptions in practice.  And in practice any
decisions that need to be made about disambiguating the output codec
are left up to the user.

  >  particularly given that programmers who don't/can't/won't
  >  understand encodings are likely to read files without specifying an
  >  encoding and a lot of the time it will *seem* to work.

But that's a different problem.  If you want to fix that you should
require an explicit codec parameter on all text I/O.  They'll still
just memorize the magic incantation and grumble about the extra
characters they have to type, but they'll have been warned.


--
http://www.ironpythoninaction.com/
http://www.voidspace.org.uk/blog

READ CAREFULLY. By accepting and reading this email you agree, on behalf of 
your employer, to release me from all obligations and waivers arising from any 
and all NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap, 
clickwrap, browsewrap, confidentiality, non-disclosure, non-compete and 
acceptable use policies (”BOGUS AGREEMENTS”) that I have entered into with your 
employer, its partners, licensors, agents and assigns, in perpetuity, without 
prejudice to my ongoing rights and privileges. You further represent that you 
have the authority to release me from any BOGUS AGREEMENTS on behalf of your 
employer.


_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to