On 2 June 2014 11:14, Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info>
> On Mon, 02 Jun 2014 08:54:33 +1000, Tim Delaney wrote:
> > I'm currently working on a product that interacts with lots of other
> > products. These other products can be using any encoding - but most of
> > the functions that interact with I/O assume the system default encoding
> > of the machine that is collecting the data. The product has been in
> > production for nearly a decade, so there's a lot of pushback against
> > changes deep in the code for fear that it will break working systems.
> > The fact that they are working largely by accident appears to escape
> > them ...
> > FWIW, changing to use iso-latin-1 by default would be the most sensible
> > option (effectively treating everything as bytes), with the option for
> > another encoding if/when more information is known (e.g. there's often a
> > call to return the encoding, and the output of that call is guaranteed
> > to be ASCII).
> Python 2 does what you suggest, and it is *broken*. Python 2.7 creates
> moji-bake, while Python 3 gets it right:
The purpose of my example was to show a case where no thought was put into
encodings - the assumption was that the system encoding and the remote
system encoding would be the same. This is most definitely not the case a
lot of the time.
I also should have been more clear that *in the particular situation I was
talking about* iso-latin-1 as default would be the right thing to do, not
in the general case. Quite often we won't know the correct encoding until
we've executed a command via ssh - iso-latin-1 will allow us to extract the
info we need (which will generally be 7-bit ASCII) without the possibility
of an invalid encoding. Sure we may get mojibake, but that's better than
the alternative when we don't yet know the correct encoding.
> Latin-1 is one of those legacy encodings which needs to die, not to be
> entrenched as the default. My terminal uses UTF-8 by default (as it
> should), and if I use the terminal to input "δжç", Python ought to see
> what I input, not Latin-1 moji-bake.
For some purposes, there needs to be a way to treat an arbitrary stream of
bytes as an arbitrary stream of 8-bit characters. iso-latin-1 is a
convenient way to do that. It's not the only way, but settling on it and
being consistent is better than not having a way.