Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info>:
> Nevertheless, there are important abstractions that are written on top
> of the bytes layer, and in the Unix and Linux world, the most
> important abstraction is *text*. In the Unix world, text formats and
> text processing is much more common in user-space apps than binary
That linux text is not the same thing as Python's text. Conceptually,
Python text is a sequence of 32-bit integers. Linux text is a sequence
of 8-bit integers.
It is great that lots of computer-to-computer formats are encoded in
ASCII (~ UTF-8). However, nowhere in linux is there a real abstraction
layer that processes Python-esque text.
Case in point:
$ env | grep UTF
$ od -c <<<"Hyvää yötä" # "Good night" in Finnish
0000000 H y v 303 244 303 244 y 303 266 t 303 244 \n
The "od" utility is asked to display its input as characters. The locale
info gives a hint that all text data is in UTF-8. Yet what comes out is
$ wc -c <<<"Hyvää yötä"
$ tr 'ä' 'a' <<<"Hyvää yötä"
Grep is smarter:
$ grep v...y <<<"Hyvää yötä"
which is why you should always prefix "grep" with LC_ALL=C in your
scripts (makes it far faster, too).