On 04/17/2018 09:18 PM, Cleber Rosa wrote: > Recently, Avocado has seen a lot of changes brought by the Python 3 > port. One fundamental difference between Python 2 and 3 is under > the spotlight: how to deal with "text" and "binary" data[1]. > > It's then important to make it clear where Avocado stands (or is > headed) when it comes to handling text, binary data and encodings, > which is the goal of this document. > > First, let's review some very basic concepts. > > Bytes, the unassuming arrays > ============================ > > On both Python 2 and 3, there's "bytes". On Python 2, it's nothing > but an alias to "str"[2]:: > > >>> import sys; sys.version[0] > 2 > >>> bytes is str > True > > One of the striking characteristics of "bytes" is that every byte > counts, that is:: > > >>> aacute = b'\xc3\xa1' > >>> len(aacute) > 2 > > This is as simple as it gets. The "bytes" type is an "array" of > bytes. > > Also, if it's not clear enough, this sequence of two bytes, happens to > be **one way** to **represent** the "LATIN SMALL LETTER A WITH > ACUTE"[3] character, as defined by the Unicode standard, in a given > encoding. Please pause for a moment and let that information settle. > > Old habits die hard > =================== > > We, humans beings, are used to deal with text. Developers, being a > special kind of human beings, are used to deal with *character arrays* > instead. Those are, or have been for a long time, sequences of > one-byte characters with specific (but somewhat implicit) meaning. > > Many developers will still assume that each byte contains a value that > maps to the ascii(7) table:: > > Oct Dec Hex Char Oct Dec Hex Char > ──────────────────────────────────────────────────────────────────────── > 000 0 00 NUL '\0' (null character) 100 64 40 @ > 001 1 01 SOH (start of heading) 101 65 41 A > 002 2 02 STX (start of text) 102 66 42 B > ... > 076 62 3E > 176 126 7E ~ > 077 63 3F ? 177 127 7F DEL > > Some other developers will assume that ASCII is a thing of the past, > and each one-byte character means something according to the latin1(7) > mapping:: > > ISO 8859-1 characters > The following table displays the characters in ISO 8859-1, which > are printable and > unlisted in the ascii(7) manual page. > > Oct Dec Hex Char Description > ──────────────────────────────────────────────────────────────────── > 240 160 A0 NO-BREAK SPACE > 241 161 A1 ¡ INVERTED EXCLAMATION MARK > 242 162 A2 ¢ CENT SIGN > ... > 376 254 FE þ LATIN SMALL LETTER THORN > 377 255 FF ÿ LATIN SMALL LETTER Y WITH DIAERESIS > > Then, there's yet another group of developers who believe that a byte > in an array of bytes may be either a character, or part of a > character. They believe in that because, Unicode and "UTF-8" is the > new standard and can be assumed to be everywhere. > > The fact is, all those developers are wrong. Not because an array of > bytes can not contain what they believe, but because one can only > guess that an array of bytes map to a character set (an encoding). > > Data itself carries no intrinsic meaning > ======================================== > > Pure data doesn't have any meaning. Its meaning depends on the > interpretation given, that is, some kind of context around it. > > When dealing with text, the meaning of data is usually determined by a > character set, a mapping table or some more advanced encoding and > decoding mechanism. > > For instance, the following sequence of numbers expressed in > decimal format and separated by spaces:: > > 66 67 68 69 70 > > Will only mean the first letters of the western alphabet, ``ABCDE``, > **if** we determine that its meaning is based on the ASCII character > set (besides other details such as ordering, separator used, etc). > > Turning arrays of bytes into text > ================================= > > On many occasions, usually when data is destined for humans, it is > necessary to present it, and to deal with it, in a different way. > Here, we use the abstract term *text* to refer to data is more > meaningful to humans, and would usually be found in documents (such as > this one) intended to be distributed and read by us, the non-machine > beings. > > Reusing the example given earlier, one can do on a Python interpreter:: > > >>> aacute = b'\xc3\xa1' > >>> len(aacute.decode('utf-8')) > 1 > > The process of turning bytes into "text" is called "decoding" by > Python. It helps to think of bytes as something that humans cannot > understand and consequently needs deciphering (or decoding) to then > become something readable by humans. > > In this process, the encoding is of the uttermost importance. It's > analogous to a symmetric key used on a cryptographic operation. For > instance, let's look at what happens when using the same data with a > different encoding:: > > >>> aacute = b'\xc3\xa1' > >>> len(aacute.decode('utf-16')) > 1 > >>> print(aacute.decode('utf-16')) > ꇃ > > Or giving too little data for a given encoding:: > > >>> aacute.decode('utf-32') > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > File "/usr/lib64/python2.7/encodings/utf_32.py", line 11, in decode > return codecs.utf_32_decode(input, errors, True) > UnicodeDecodeError: 'utf32' codec can't decode bytes in position 0-1: > truncated data > > Even though Unicode is increasingly popular, it's also a good idea to > remind ourselves that other, non-Unicode encodings exist. For > instance, look at the same data when decoded using a character set > developed for the Thai language:: > > >>> len(aacute.decode('tis-620')) > 2 > >>> print(aacute.decode('tis-620')) > รก > > Now, think about this: if you expect quick, consistent and reliable > cryptographic operations, would you save a key for later use? Or > would you just guess it whenever you need it? > > Hopefully, you've answered that you would save the key. The same > applies to encoding: you should keep track of what you're using. > > What Python offers > ================== > > There are a number of features that Python offers related to the > encoding used. Some of them have differences depending on the > Python version. When that's the case, the version used is made > clear. > > Let's review those features now. > > sys.getfilesystemencoding() > --------------------------- > > From the documentation, this function will *Return the name of the > encoding used to convert Unicode filenames into system file names, or > None if the system default encoding is used".* > > To demo how this works, let's create a base directory with ASCII only > characters (and using the byte type to avoid any implicit encoding):: > > >>> import os > >>> os.mkdir(b'/tmp/mydir') > > And then, let's explicit create a directory, again using a sequence of > bytes:: > > >>> os.mkdir(b'/tmp/mydir/\xc3\xa1') > > If you look at the content of the ``/tmp/mydir`` directory, you should > find a single file:: > > >>> os.listdir(b'/tmp/mydir') > ['\xc3\xa1'] > > Which is just what we expected. Now, we'll start Python (2.7) with a > environment variable that will influente the encoding it'll use for > conversion of Unicode filenames:: > > $ LANG=en_US.ANSI_X3.4-1968 python2.7 > >>> import sys > >>> sys.getfilesystemencoding() > 'ANSI_X3.4-1968' > > Now, let ask Python to list all files (by using the standard library > module ``glob``) in that directory:: > > $ LANG=en_US.ANSI_X3.4-1968 python2.7 -c "import glob; > print(glob.glob(u'/tmp/mydir/\u00e1*'))" > [] > > The list is empty because ``glob`` fails to match the reference given in the > encoding used. Basically, think of what would happen if you were to > do:: > > >>> u'/tmp/mydir/\u00e1*'.encode(sys.getfilesystemencoding()) > > On the other hand, by using an appropriate encoding:: > > $ LANG=en_US.UTF-8 python2.7 -c "import glob; > print(glob.glob(u'/tmp/mydir/\u00e1*'))" > [u'/tmp/mydir/\xe1'] > > The point here is that ``sys.getfilesystemencoding()`` will be used by > some Python libraries when working with filenames. > > .. warning:: Don't expect any code to be perfect. For instance, the > author could find some issues with the ``glob`` module > used in the example above. > > sys.std{in,out,err}.encoding > ---------------------------- > > An ``encoding`` attribute may be set on ``sys.stdin``, ``sys.stdout`` > and ``sys.stderr`` to let applications know how to input and output > meaningful text. > > Suppose you need to read text from the standard input and save it to > a file on a specific encoding. The following script is going to be > used as an example (``read_encode.py``):: > > import sys > > > # On Python 3 "str" is unicode > if sys.version_info[0] >= 3: > unicode = str > > sys.stdout.write("Enter text:\n") > > input_read = sys.stdin.readline().strip() > if isinstance(input_read, unicode): > bytes_read = input_read.encode(sys.stdin.encoding) > else: > bytes_read = input_read > > with open('/tmp/data.bin', 'wb') as data_file: > data_file.write(bytes_read) > > Now, on both Python 2 and 3 this produces the same results:: > > $ python2 -c 'import sys; print(sys.stdin.encoding)' > UTF-8 > $ python2 read_encode.py > Enter text: > áéíóú > $ file /tmp/data.bin > /tmp/data.bin: UTF-8 Unicode text, with no line terminators > > $ python3 -c 'import sys; print(sys.stdin.encoding)' > UTF-8 > $ python3 read_encode.py > Enter text: > áéíóú > $ file /tmp/data.bin > /tmp/data.bin: UTF-8 Unicode text, with no line terminators > > The encoding set on ``sys.stdin.encoding`` was important to > the example script as it needs to turn unicode into bytes. > > Now, suppose that your application, while reading input that matches > the user's environment, must produce a file in the ``UTF-32`` > encoding. The code to do that could look similar to the following > example (``write_utf32.py``):: > > import sys > > > sys.stdout.write("Enter text:\n") > > input_read = sys.stdin.readline().strip() > if isinstance(input_read, bytes): > unicode_str = input_read.decode(sys.stdin.encoding) > else: > unicode_str = input_read > > with open('/tmp/data.bin.utf32', 'wb') as data_file: > data_file.write(unicode_str.encode('UTF-32')) > > Again, let'see how this performs under Python 2 and 3:: > > $ python2 -c 'import sys; print(sys.stdin.encoding)' > UTF-8 > $ python2 write_utf32.py > Enter text: > áéíóú > $ file /tmp/data.bin.utf32 > /tmp/data.bin.utf32: Unicode text, UTF-32, little-endian > > $ python3 -c 'import sys; print(sys.stdin.encoding)' > UTF-8 > $ python3 write_utf32.py > Enter text: > áéíóú > $ file /tmp/data.bin.utf32 > /tmp/data.bin.utf32: Unicode text, UTF-32, little-endian > > .. tip:: do not assume that ``sys.std{in,out,err}`` will always have > the ``encoding`` attribute, or that they'll be set to a valid > encoding. For instance, when ``sys.stdin`` is not a TTY, > it's ``encoding`` attribute will have a ``None`` value. > > A few points can be realized here: > > 1) Using Unicode strings internally, as an intermediate format, gives > you the freedom to read (decode) from different encodings, and at > the same time, to write (encode) into any other encoding. > > 2) Code that is expected to work under both Python 2 and 3 need > some extra handling with regards to the data type being handled. > > 3) While determining the data type, one can either check for ``bytes`` > or for ``unicode``. While it's certainly a matter of preference > and style, keep in mind that the ``bytes`` name exists on both > Python 2 and 3, while ``unicode`` exists only on Python 2. > > locale > ------ > > This Python standard library module is a wrapper around POSIX > locale-related functionality. > > Because this discussion is about text encodings, let's focus on the > ``locale.getpreferredencoding()`` function. Acording to the > documentation, it *"Return(s) the encoding used for text data, > according to user preferences"*. As it wraps specificities of different > platforms, it may not be able to problem it on some systems, and > because of that, the documentation notes that *"this function only > returns a guess"*. > > Even though it may be a guess, it is probably the best bet you can > make. > > .. tip:: Many non-Linux/UNIX platforms implement some level of POSIX > functionality, and that happens to be the case for the > ``locale`` features discussed here. Because of that, the > Python ``locale`` module can also be found on platforms such > as Microsoft Windows. > > Error Handling > -------------- > > Some encoding and decoding operations just won't be possible. The > most straightforward example is when you're trying to map a value > that is outside the bounds of the mapping table. > > For instance, the ASCII character set defines mappings for values in > the range of 0 up to 127 (7f in hexadecimal). That means that a value > larger than 127 (7f in hexadecimal) will cause an error. > > When dealing with Unicode strings in Python, those errors are > represented as a ``UnicodeError`` (whose most common subclasses > are ``UnicodeEncodeError`` or ``UnicodeDecodeError``). > > Getting back to the simple example given before, trying to decode > the value 126 (7e when represented in hexadecimal) using the ASCII > character set should work fine:: > > >>> b'\x7e'.decode('ascii') > u'~' > > But anything larger than 127 (7f) won't work:: > > >>> b'\x80'.decode('ascii') > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position > 0: ordinal not in range(128) > > The reason for the failure is explicit in the error message: 0x80 > (given in hex) is decimal 128, which is indeed not in ``range(128)`` > (which is zero based, and thus contains 0-127). > > There may be situations in which a different error handling may be > beneficial. Instead of catching ``UnicodeError`` exceptions and > handling them on an individual basis, it's possible to use a > registered error handler. Let's use a builtin error handler as > an example. > > Suppose that your application reads from a file that known to be > encoded in ``UTF-8``, and you need to output to system's preffered > encoding (as defined by ``locale.getpreferredencoding()``). To make > for a more realistic example, let's imagine that the application is > test runner like Avocado itself, reading from a file containing > definitions of test variations and parameters, and writing out the > test variation IDs that were executed. The test variations/parameters > file will look like this (again, encoded in ``UTF-8``):: > > intel-überdisk-workstation-20-12b3:cpu=intel;disk=überdisk; > intel-virtio-workstation-20-b322:cpu=intel;disk=virtio > amd-überdisk-workstation-20-c523:cpu=amd;disk=überdisk > amd-virtio-workstation-20-ddf3:cpu=amd;disk=virtio > > And the code to parse and report the tests could look like this:: > > import io > import locale > > > INTERNAL_ENCODING = 'UTF-8' > > with io.open('parameters', 'r', encoding=INTERNAL_ENCODING) as > parameters_file: > parameters_lines = parameters_file.readlines() > > test_variants_run = [line.split(":", 1)[0] for line in parameters_lines] > with io.open('report.txt', 'w', > encoding=locale.getpreferredencoding()) as output_file: > output_file.write(u"\n".join(test_variants_run)) > > Now, on a given system, this run as expected:: > > $ python -c 'import locale; print(locale.getpreferredencoding())' > UTF-8 > $ python read_parameters_write_report.py && cat report.txt > intel-überdisk-workstation-20-12b3 > intel-virtio-workstation-20-b322 > amd-überdisk-workstation-20-c523 > amd-virtio-workstation-20-ddf3 > $ file report.txt > report.txt: UTF-8 Unicode text > > But on a **different** system:: > > $ python -c 'import locale; print(locale.getpreferredencoding())' > ANSI_X3.4-1968 > $ python read_parameters_write_report.py && cat report.txt > Traceback (most recent call last): > File "read_parameters_write_report.py", line 12, in <module> > output_file.write(u"\n".join(test_variants_run)) > UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in > position 6: ordinal not in range(128) > > One possible solution is using an error handler, such as ``replace``. > By adding the ``errors`` parameter to the ``io.open``:: > > --- read_parameters_write_report.py 2018-04-17 > 18:33:26.781059079 -0400 > +++ read_parameters_write_report.py.new 2018-04-17 > 18:33:58.677181944 -0400 > @@ -9,5 +9,6 @@ > > test_variants_run = [line.split(":", 1)[0] for line in parameters_lines] > with io.open('report.txt', 'w', > - encoding=locale.getpreferredencoding()) as output_file: > + encoding=locale.getpreferredencoding(), > + errors='replace') as output_file: > output_file.write(u"\n".join(test_variants_run)) > > The result becomes:: > > intel-?berdisk-workstation-20-12b3 > intel-virtio-workstation-20-b322 > amd-?berdisk-workstation-20-c523 > amd-virtio-workstation-20-ddf3 > > Which may be better than crashing, but may also be unacceptable > because information is lost. One alternative is to escape the data. > Using the ``backslashreplace`` error handler, ``report.txt`` would look > like:: > > intel-\xfcberdisk-workstation-20-12b3 > intel-virtio-workstation-20-b322 > amd-\xfcberdisk-workstation-20-c523 > amd-virtio-workstation-20-ddf3 > > This way, no information is lost, and the generated report respects > the system preferred encoding:: > > $ file report.txt > report.txt: ASCII text > > Guidelines > ========== > > This section sets the general guidelines for byte/text data in > Avocado, and consequently for the encoding used. It should be > followed by Avocado plugins developed externally, so that a consistent > combined work is achieved. > > It can also be used as a guideline for test writers that are target > Avocado on both Python 2 and 3. > > 1) When generating text that will be consumed by humans, Avocado SHOULD > respect the preferred system encoding. When that is not available, > Avocado's default encoding (currently ``UTF-8``, as defined in > ``avocado/core/defaults.py``) should be used. > > 2) When operating on data that may or may not contain text, Avocado > SHOULD treat the data as binary. If the owner of the data knows it > contains text destined for humans, what we call text, then the data > owner should handle the decoding. It's OK for utility APIs to have > helper functionality. One example is the > ``avocado.utils.process.CmdResult`` class, which contains both > ``stdout`` and the ``stdout_text`` attribute/property. Even then, > the user producing the data is responsible for determinig the > encoding used when treating the data as text. > > 3) When operating on data that provides encoding as metadata (by using > an alternative channel or that can reliably be obtained from the > data itself), Avocado MUST respect that encoding. One example is > respecting the encoding that can be given on the ``Content-Type`` > headers on an HTTP session. > > 4) Avocado functionality CAN restrict the encodings it generates if an > expressive enough character set is used and the generated data > contains metadata that clearly defines the encoding used. One > example is the HTML plugin, which is currently limited to producing > content in ``UTF-8``. > > 5) All input given by humans to the Avocado test runner, such as test > references, parameters coming from files and other loader > implementations, command line parameter values and others, should > be treated as text unless noted otherwise. This means that Avocado > should be able to deal with test references given in the > system's preferred encoding transparently. > > 6) Avocado code should, when operating on text data, use unicode > strings internally (``unicode`` on Python 2, and ``str`` on Python > 3). > > Besides those points, it's worth noting that a number of utility > functionality related to binary and text data, and encoding handling, > is growing organically, and be seen on modules such as > ``avocado.utils.astring``. Further functionality is currently being > proposed upstream and may soon be part of the Avocado libraries. > > Caveats > ======= > > While handling text and binary types on Avocado, please pay attention > to the following caveats: > > 1) The Avocado test runner replaces the stock > ``sys.std{in,out,err}.encoding``, so if you're writing a plugin, do > not assume/expect these to contain an encoding setting. > > 2) Some features on core Avocado, as well as on external plugins, > still fall short of the guidelines described here. This is a > work in progress. Please exercise care when >
Sorry for this "bad ending". I just meant "please exercise care, and investigate the code status when relying on it". Cheers! - Cleber. > --- > > [1] - > https://docs.python.org/3/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit > [2] - https://docs.python.org/2.7/c-api/object.html#c.PyObject_Bytes > [3] - http://unicode.scarfboy.com/?s=u%2B00e1 > -- Cleber Rosa [ Sr Software Engineer - Virtualization Team - Red Hat ] [ Avocado Test Framework - avocado-framework.github.io ] [ 7ABB 96EB 8B46 B94D 5E0F E9BB 657E 8D33 A5F2 09F3 ]