Michael Foord writes: > When reading text files the presence of the UTF-8 signature *almost > invariably* means a UTF-8 encoding. Honouring this will almost always be > better than using the wrong encoding. Of course there are caveats, but > it will be a substantial improvement.
Sure, that would be better than using the wrong encoding *if* the only thing that matters is getting the input codec right. But it's not clear that it's an improvement from the naive programmers' point of view, which needs to take into account the behavior of the whole application. Is it an improvement if it "seems to work" in testing, and then munges something important to the boss because she has a correspondent who uses UTF-8, not UTF-8-signature? Maybe it's better if it screws up almost all the time, so that the problem is detected early! > Unless you keep the information about the original encoding along with > the decoded string changing the (default0 output encoding depending on > the input is simply not possible - and so not really relevant. That's throwing the baby out with the bathwater. Very few practical applications that care about the input encoding are going to be willing to accept an output encoding that doesn't correspond to the input encoding in an appropriate way. *If* you are going to advocate guessing about the input encoding, even based on very strong signals like the UTF-8 signature, then you really have to advocate adding the infrastructure to ensure that the output encoding is properly set. If the output encoding is the programmer's problem, then it's purely pandering to laziness not to ask them to deal with the input encoding as well. _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com