Re: Python 3.2 has some deadly infection

Marko Rauhamaa Thu, 05 Jun 2014 09:57:48 -0700

Steven D'Aprano <[email protected]>:

> Nevertheless, there are important abstractions that are written on top
> of the bytes layer, and in the Unix and Linux world, the most
> important abstraction is *text*. In the Unix world, text formats and
> text processing is much more common in user-space apps than binary
> processing.


That linux text is not the same thing as Python's text. Conceptually,
Python text is a sequence of 32-bit integers. Linux text is a sequence
of 8-bit integers.

It is great that lots of computer-to-computer formats are encoded in
ASCII (~ UTF-8). However, nowhere in linux is there a real abstraction
layer that processes Python-esque text.

Case in point:

   $ env | grep UTF
   LANG=en_US.UTF-8
   $ od -c <<<"Hyvää yötä"     # "Good night" in Finnish
   0000000   H   y   v 303 244 303 244       y 303 266   t 303 244  \n
   0000017

The "od" utility is asked to display its input as characters. The locale
info gives a hint that all text data is in UTF-8. Yet what comes out is
bytes.

How about:

   $ wc -c <<<"Hyvää yötä"
   15
   $ tr 'ä' 'a' <<<"Hyvää yötä"
   Hyvaaaa ya�taa

Grep is smarter:

   $ grep v...y <<<"Hyvää yötä"
   Hyvää yötä

which is why you should always prefix "grep" with LC_ALL=C in your
scripts (makes it far faster, too).


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Python 3.2 has some deadly infection

Reply via email to