On 24/01/2010 14:23, Stephen J. Turnbull wrote:
Michael Foord writes:
> This is why I'm keen that by *default* Python should honour the UTF8
> signature when reading files;
Unfortunately, your caveat about "a lot of the time it will *seem* to
work" applies to this as well. The only way that "honoring
signatures" really works is if Python simply uses the UTF-8 codec on
input and output by default, regardless of locale. Or perhaps if by
default Python should error out unless a signature is found.
When reading text files the presence of the UTF-8 signature *almost
invariably* means a UTF-8 encoding. Honouring this will almost always be
better than using the wrong encoding. Of course there are caveats, but
it will be a substantial improvement.
Autodetection (ie, doing something different depending on the presence
or absence of the signature) does not really work, because for it to
work correctly, it needs to imply automatic resetting of the output
codec as well. So what is your naive programmer supposed to expect
when writing a cat program? Should the first encoding detected or
defaulted determine the output codec? The last one? UTF-8 uber
alles?
Unless you keep the information about the original encoding along with
the decoded string changing the (default0 output encoding depending on
the input is simply not possible - and so not really relevant.
Michael
Such autodetection *can* be done fairly accurately. After 20 years of
experimenting, Emacs has it pretty much right. But ... Emacs almost
never runs without a human watching it. And the code that handles
this is a mess of special cases and heuristics. Not to mention
throwing more than a few exceptions in practice. And in practice any
decisions that need to be made about disambiguating the output codec
are left up to the user.
> particularly given that programmers who don't/can't/won't
> understand encodings are likely to read files without specifying an
> encoding and a lot of the time it will *seem* to work.
But that's a different problem. If you want to fix that you should
require an explicit codec parameter on all text I/O. They'll still
just memorize the magic incantation and grumble about the extra
characters they have to type, but they'll have been warned.
--
http://www.ironpythoninaction.com/
http://www.voidspace.org.uk/blog
READ CAREFULLY. By accepting and reading this email you agree, on behalf of
your employer, to release me from all obligations and waivers arising from any
and all NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap,
clickwrap, browsewrap, confidentiality, non-disclosure, non-compete and
acceptable use policies (”BOGUS AGREEMENTS”) that I have entered into with your
employer, its partners, licensors, agents and assigns, in perpetuity, without
prejudice to my ongoing rights and privileges. You further represent that you
have the authority to release me from any BOGUS AGREEMENTS on behalf of your
employer.
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com