On 13 January 2014 09:55, Guido van Rossum <gu...@python.org> wrote:
> There's a lot of discussion about PEP 460 and I haven't read it all.
> Maybe you all have already reached the same conclusion that I have. In
> that case I apologize (but the PEP should be updated). Here's my
> contribution:
>
> PEP 460 itself currently rejects support for %d, AFAIK on the basis
> that bytes aren't necessarily ASCII. I think that's a misunderstanding
> of the intention of the bytes type.
>
> The key reason for introducing a separate bytes type in Python 3 is to
> avoid *mixing* bytes and text. This aims to avoid the classic Python 2
> Unicode failure, where str+unicode fails or succeeds based on whether
> str contains non-ASCII characters or not, which means it is easy to
> miss in testing. Properly written code in Python 3 will fail based on
> the *type* of the objects, not based on their contents. Content-based
> failures are still possible, but they occur in typical "boundary"
> operations such as encode/decode.
>
> But this does not mean the bytes type isn't allowed to have a
> noticeable bias in favor of encodings that are ASCII supersets, even
> if not all bytes objects contain such data (e.g. image data,
> compressed data, binary network packets, and so on).

I am a strong -1 on the more lenient proposal, as it makes binary
interpolation in Python 3 an *unsafe operation* for ASCII incompatible
binary formats.

The existing binary operations that assume ASCII do so *inherently* -
they're not input driven, the operation itself assumes ASCII, so if
you're working with data that may not be ASCII compatible, you simply
don't use them (these are operations like title(), upper(), lower(),
the default arguments for split() and strip(), etc). They don't accept
text or other structured data as input - you have to provide existing
binary data or individual byte values (or, in the case of split(),
strip(), the special value None to indicate the assumption of ASCII
whitespace).

With PEP 460 as it stands, binary interpolation is safe - you can't
implicitly introduce an ASCII assumption, regardless of the format
string or input data, as everything that hasn't already been
translated to the binary domain will be rejected with a TypeError. By
allowing format characters that *do* assume ASCII, the entire
construct is rendered unsafe - you have to look inside the format
string to determine if it is assuming ASCII compatibility or not, thus
the entire construct must be deemed as assuming ASCII compatibility at
the level of static semantic analysis.

The more lenient proposal also creates an ambiguity about what it
means to pass an integer to a binary formatting operation - is it
about inserting individual byte values in the range 0-255, or is it
about inserting the ASCII encoded digits of arbitrary byte strings, or
does it depend on which formatting code you use? PEP 460 is currently
entirely consistent with the other binary operations (it only accepts
integers in the 0-255 range and interprets them as byte values), while
the more lenient approach goes for the "it depends on the formatting
code" alternative.

Allowing these ASCII assuming format codes in the core bytes
interpolation introduces *exactly* the same problem as is present in
the Python 2 text model: code that *appears* to support arbitrary
binary data, but is in fact assuming ASCII compatibility. So any code
that has to handle ASCII incompatible encodings will need to be
implemented with the warning "don't use any of the binary formatting
operations for data that may not be ASCII compatible, but we also
don't provide a convenient equivalent that can be guaranteed to be
safe so we know you're going to ignore this warning and do it anyway".
That kind of "don't do that, it may cause problems with certain
inputs" is *exactly* the kind of bug magnet that the Python 3
transition was designed to categorically eliminate.

PEP 460 is perfect in that regard - it provides exactly as much
functionality as can be done correctly when manipulating arbitrary
binary data, and no more. It has no trace of the legacy Python 2 text
model.

However, I *also* accept that the Python 2 text model is convenient
for certain use cases. This is why, in addition to PEP 460 as it
currently stands, I am also (with Benno Rice) one of the instigators
of the asciicompat project, and have promised Benno that I will ensure
that any interoperability bugs asciicompat.asciistr uncovers in the
core types are fixed (for Python 3.3+, since it depends on the PEP 393
internal representation for strings). asciistr will provide a public
API that behaves *exactly* like a text type (including interoperating
with strings and returning length 1 substrings when indexing,
intepreting integers and other numeric types as their ASCII
representation when passed in,  supporting *full* text formatting
semantics), but also exists in the binary domain, by exporting the
bytes view of its internal data through the PEP 3118 buffer API.

In this way, asciistr will be a *new* general purpose mechanism for
translating between the binary and text domains in Python 3, just like
str.encode, bytes/bytearray/memoryview.decode and the struct module.
It doesn't need to compromise - it's objectives are to make working
with ASCII compatible binary protocols and writing hybrid binary/text
APIs exactly as convenient as it was in Python 2, because that's where
the test suite is developed: in Python 2, using "asciistr=str". It
just doesn't need to be a builtin and, at this point in time, doesn't
even need to be in the standard library. It can be developed on GitHub
and published on PyPI and made available for Python 3.3 and above
(it's also trivially 2.x compatible: there, it just republishes the
str builtin as asciicompat.asciistr)

Once asciistr is working, we can also look into creating
"asciicompat.asciiview", which would be a PEP 3118 *consumer* in
addition to a publisher, and provide asciistr functionality for
existing binary data, without needing to copy it.

ASCII compatible protocols *are* special and *are* worthy of having a
dedicated type devoted to handling them. However, it shouldn't be at
the expense of compromising the ability of Python 3 users to ensure
that they aren't accidentally introducing assumptions of ASCII
compatibility where they don't belong, particularly when doing so
produces a clearly *inferior* solution. The superior solution looks
like this:

* bytes/bytearray/memoryview: pure binary types, operate entirely in
the binary domain. They provide convenience operations that are only
valid for ASCII compatible data, but the ASCII assumption is inherent
in the operation itself rather than being input driven (the one minor
exception being that passing None to split() and strip() operations
assumes ASCII whitespace).
* asciicompat.asciistr: hybrid type that exposes a text API in the
application domain, but also exposes binary data directly for binary
interoperability
* str: pure text type, operates entirely in the application domain.

This approach also opens up the possibility of eventually leveraging
PEP 393 to provide an asciicompat.utf8str type which allows arbitrary
unicode characters and exports the UTF-8 representation, rather than
restricting the permitted code points to 7-bit ASCII, as well as an
asciicompat.latin1str which permits arbitrary 8 bit data (representing
it as latin-1 text in the application domain), or even an
asciicompat.encodedstr that supports any 8-bit encoding.

The key thing that the text model change in Python 3 enabled is for us
to use the type system to *help* with managing the complexity of
dealing with text encodings. We've got a long way with just the two
pure types, and no additional types that straddle the binary/text
boundary the way the Python 2 str type did. Unlike introducing *new*
ASCII-only operations to the bytes type, adding new types specifically
for dealing with ASCII compatible formats (especially starting life as
a third party library) isn't compromising the Python 3 text model,
it's embracing it and making it work for us (which is why I've been
suggesting that it be considered since at least 2010). The problem
with "str" in Python 2 was that one type was used to represent too
many things with serious semantic differences.

The ongoing attempts to reintroduce that ambiguity to the core bytes
type rather than exploring the creation of new types and then filing
bugs for any interoperability issues those attempts uncover in the
core types represents one of the worst cases of paradigm lock that I
have ever seen :P

Regards,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to