Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

Stephen J. Turnbull Sun, 12 Jan 2014 19:03:46 -0800

Ethan Furman writes:

 > 1) Are you saying it's okay to be insulting when frustrated?  I
 >    also find this mega-thread frustrating, but I'm trying
 >    very hard not to be insulting.


OK, no.  Understandable, yes.

 > 2) If you are going to use my name, please be certain of the facts
 >    [1].  More below.
 > 
 > > MAL posted straight out the Python 2 model of text makes it easier for
 > > him to write some programs, so he's all for reintroducing it.  And
 > > that is the whole truth of the matter.  Although I disagree with him,
 > > I appreciate his honesty.
 > 
 > If you have an example of me lying (even if it's just a
 > possibility), please refer to it directly so I can either try to
 > explain the misunderstanding or apologize.

Praising one person for honesty doesn't imply anybody else is lying.

As for the Artist Currently Posting as Ethan Furman, he's not in the
"disingenous" group.  I don't think you understand the issues at stake
(among other things, as I've discussed elsewhere, I think your use
case is different from the use cases of most of those who are asking
for bytes formatting).  And there's a crucial terminology difference:

 > In only one case did I use the word "text" loosely,

>From my point of view, you consistently do so.  Bytes are *never*
Python 3 text in my terminology, and I think that is generally
accepted on these channels.  "ASCII-encoded text" as you call it (and
repeatedly do so), and want to manipulate using str-like methods on
bytes, is *exactly* the Python 2 model of text.  But you deny that the
effect of your proposals (eg, b"%d" % (12,)) is to reintroduce Python
2's bytes/character confusion, don't you?

Yes, I've used "ASCII-compatible text" in some of my posts, but I
recognize that as "loose usage", too, and would stop if requested.
Note I'm not asking you to stop -- I think we all understand what you
mean, even though for some of us it's loose terminology.  What I do
hope you will recognize is that adding str-like methods to bytes is
precisely the Python 2 model of text processing[1], and that like MAL
you will say, "OK, I don't see a problem with reintroducing Python 2's
byte/character confusion."  (Well, I *really* want you to see the
light, and retract your proposal for b'%d' format.  But that hardly
seems likely. :-)

 > But don't lie to me (as Nick tried to) and say that "In particular,
 > the bytes type is, and always will be, designed for pure binary
 > manipulation" when it has methods like .center().

I hardly think Nick is *lying*, any more than you are.  AFAICT, you're
*both* wrong.  According to PEP 3137[2] by Guido van Rossum, the idea
of the immutable bytes type was suggested (in various aspects which
combined to overcome Guido's initial opposition) by Gregory P. Smith,
Jeffrey Yasskin, and Talin.  Guido then chose to implement it by
grabbing the Python 2 code, and removing .encode, and removing
locale-dependent definitions of character classes.  This was with a
view to supporting ports of code that implements wire protocols or
uses bytes as encoded text:

    It also makes it possible to efficiently create hash tables using
    bytes for keys; this may be useful when parsing protocols like
    HTTP or SMTP which are based on bytes representing text.

    Porting code that manipulates binary data (or encoded text) in
    Python 2.x will be easier using the new design than using the
    original 3.0 design with mutable bytes; simply replace str with
    bytes and change '...' literals into b'...' literals.

IIRC, only later was regex support added to bytes (by Nick himself,
again IIRC).  And despite the quote above, I don't think Guido meant
to encourage use of bytes as text in wire protocol development, at
least not at that time.  

Note that Nick has already admitted that permitting even methods that
can be implemented purely as numerical manipulations:

    def is_uppercase(b):
        # Note all comparisons are between integers:
        return ord('A') <= b[0] and b[0] <= ord('Z')

was in retrospect a mistake (in his opinion).  So I don't think it was
a lie, merely a difference in your definitions of "pure binary
manipulation".  (Which isn't surprising, given that ultimately
everything in computers as we know them today eventually reduces to
"pure binary manipulations".[3]  Drawing the line is going to involve
personal taste to some extent.)  I think his interpretation that bytes
were *designed* that way is a bit strained given PEP 3137.  I also
don't know what was discussed at language summits, and don't recall
the python-dev conversations about it at all.

A final remark: Be very careful in interpreting Guido's words in these
"practical vs. pure" matters.  I've discovered his offhand comments on
these matters are often both subtle and deep (that probably doesn't
surprise you), and that the idea behind them is usually extremely
precise though his expression may informal or even casual (and here be
dragons -- taking the expression too literally may lead you astray).

 > I think some of the misunderstanding (which you also seem to suffer
 > from) is that we (or at least I) /ever/ want a unicode string back
 > from bytes interpolation.  I don't!

Please tell me why you think I suffer from that misunderstanding.  I
certainly don't think you *want* Unicode strings.  You've been quite
strident about the fact that you don' need no steekin' yooneekode (for
these purposes).

What I want to find out is why your use case can't be handled with
Python 3 str.  That's why I provide examples (mostly parallel to
yours) that return str in Python 3 (I can't speak for anyone else).

 > To summarize, I used the term text when referring to unicode text
 > (str), ASCII or ASCII-encoded text to refer to bytes that are to be
 > used in a place that requires ASCII bytes for communication (such
 > as content length or field type).

I've never been confused about that, but your use of the word "text"
in a way differently from others in the thread seems to confuse you
about what *they* mean.

But did you get that I'm worried that programmers in Omaha will use
that same functionality to communicate American English (for which it
is basically sufficient, and which also requires ASCII when bytes are
used for communication)?

 > *My* definition is not ambiguous at all.  If this particular part
 > of the byte stream is defined to contain ASCII-encoded text, then I
 > can use the bytes text methods to work with it.

But how is Python supposed to know that?  The point of having types in
a programming language is so that either the interpreter can just
DTRT, or raise an exception if TRT is ambiguous, without explicit
specification by the programmer.  This is precisely what asciistr is
for: it knows that it is both unicode and bytes compatible, and morphs
automatically to whichever it is combined with.  And does so
efficiently (because they're all immutable, any combination of these
types in Python involves copying "code units", and for asciistr that
copy is always of bytes, thus reducing eventually to memcpy for bytes
and latin1-only str).

But under your definition, you need to make the decision, or
explicitly code the decision, on the basis of context.

 > > When it's convenient for them to use text-processing operations
 > > on bytes, they'll say "oh, yes, these are conventionally
 > > considered text-processing features, but that's just an accident
 > > of the particular configuration of bytes -- yup, bytes -- I'm
 > > processing."
 > 
 > If that particular configuration of bytes is because it's
 > ASCII-encoded text, then sure.

Once again, you are advocate precisely the Python 2 model of text.

 > To use, for example, bytes.__upper__ on data that wasn't
 > ASCII-encoded text (even if it happened to look like it was) would
 > be the height of stupidity.  Please don't include me in such
 > accusations.

I have no idea why you think I think anybody would be that stupid.
That never occured to me.  It's precisely "magic numbers" that happen
to look like English words when interpreted as ASCII coded characters
that I don't want manipulated by str-like methods that interpret text
(such as full-featured format or %).

If b"Content-Length: 123" is (ASCII-encoded) text, then it should be
created as, or decoded to, internal text and handled that way.  If
it's binary, then handle it as binary.

 > > ambiguous form".  IMO, with the proposed changes, that is likely to
 > > continue indefinitely, negating some of the gains I expected to
 > > receive from Python 3. :-(
 > 
 > This would be a good reason to reject PEP 460, if that danger was
 > deemed more likely than the good it would bring.

Depends on which version.  I earlier opposed PEP 460 in any form, but
I'm persuaded by Nick's particular definition of "pure binary
manipulation" and agree that PEP 460 as revised by Antoine is harmless
to my goals.  Although I personally am unlikely to find any great
convenience from it (both as a matter of style and to a great extent a
lack of use cases, although I'd like to get involved in the email
module).

 > > Note: there are a lot of high-level frameworks like Django that even
 > > in Python 2 basically went to Unicode everywhere internally.  I don't
 > > deny that.  I think that Python 3 as currently constituted makes it a
 > > lot easier to make an appropriate decision of where to convert, and
 > > should take some of the burden off the high-level frameworks.
 > > Approving this PEP, especially in a maximalist form, will blur the
 > > lines.
 > 
 > I understand your point, but I disagree.  When I open a file (in
 > binary mode, obviously, as otherwise I'd get massive corruption)

Obviously, *you* would open the file in binary mode, but by definition
of the latin1 codec and the surrogateescape handler, *I* can
definitely avoid any corruption when reading such files as text.
(This may require painful contortions if one does any nontrivial
processing, but then again it may not.)

 > I get back a bunch of bytes.  When working with tcp, I get back a
 > bunch of bytes.  bytes are /already/ the boundary type.

No, they are not.  Clearly there are "just bytes" on the "outside" of
I/O in each of your examples here, and they are "just copied" to the
inside of Python.  But in Nick's sense, this is the "outside," *not*
the "inside", of your program!  On the "inside", *you* want "a bool,
an int, a float, a date, or, even, a str" (I'm quoting!).  What Nick
means by a "boundary type" is a type that works seamlessly with the
types on each side of the boundary as a helper in the conversion.  So
when you use a struct to pack a bool, an int, and a date into a bytes,
the struct is the boundary type.  And if there's a helper type to work
with bytes and/or str simultaneously, that's a boundary type, eg,
asciistr.  But bytes itself is not a boundary type, it's just a type
with no internal structure, not even characters.

 > If we have to make a third type for proper boundary processing it's
 > an admission that bytes failed in its role.

That admission was made in PEP 3100.

Or, more precisely, bytes was never considered as a boundary type in
Python 3.

Footnotes: 
[1]  To be precise, one of two models, the other one being the unicode
type.

[2]  http://www.python.org/dev/peps/pep-3137/

[3]  OK, OK, I still have my Daddy's K&E loglog slide rule.  Not
*everything* is binary!

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

Reply via email to