On Mon, 02 Jun 2014 12:10:48 +0100, Robin Becker wrote:
> there seems to be an implicit assumption in python land that encoded
> strings are the norm. On virtually every computer I encounter that
> assumption is wrong. The vast majority of bytes in most computers is not
> something that can be easily printed out for humans to read. I suppose
> some clever pythonista can figure out an encoding to read my .o / .so
> etc files, but they are practically meaningless to a unicode program
> today. Same goes for most image formats and media files. Browsers
> routinely encounter mis/un-encoded pages.
If you include image, video and sound files, you are probably correct
that most content of files is binary.
Outside of those three kinds of files, I would expect that *by far* the
single largest kind of file is text. Some text is wrapped in a binary
layer, e.g. .doc, .odt, etc. but an awful lot of it is good old human
readable text, including web pages (html) and XML.
Every programming language I know of defaults to opening files in text
mode rather than binary mode. There may be exceptions, but reading and
writing text is ubiquitous while writing .o and .so files is not.
> In python I would have preferred for bytes to remain the default io
> mechanism, at least that would allow me to decide if I need any
That implies that you're opening files in binary mode by default. It also
implies that even something as trivial as writing the string "Hello
World" to a file (stdout is a file) is impossible until you've learned
about encodings and know which encoding you need. I really don't think
that's a good plan, for any language, but especially a language like
Python which is intended for beginners as well as experts.
The Python 2 approach, where stdout in binary but tries really hard to
pretend to be a superset of ASCII, is simply broken. It works well for
trivial examples, while breaking in surprising and hard-to-diagnose ways
in others. It violates the Zen, errors should not be ignored unless
explicitly silenced, instead silently failing and giving moji-bake:
[steve@ando ~]$ python2.7 -c "import sys; sys.stdout.write(u'ñβж\n')"
Changing to print doesn't help:
[steve@ando ~]$ python2.7 -c "print u'ñβж'"
Python 3 works correctly, whether you use print or sys.stdout:
[steve@ando ~]$ python3.3 -c "import sys; sys.stdout.write(u'ñβж\n')"
(although I haven't tested it on Windows).