Re: [Python-Dev] PEP 460 reboot

Terry Reedy Tue, 14 Jan 2014 13:57:59 -0800

Let me answer you both since the issues are related.

On 1/14/2014 7:46 AM, Nick Coghlan wrote:

Guido van Rossum writes:
  > And that is precisely my point. When you're using a format string,

Bytes interpolation uses a bytes format, or a byte string if you will,but it should not be thought of as a character or text string. Certainbytes (123 and 125) delimit a replacement field. The bytes in betweendefine, in my version, a format-spec after being ascii-decoded to textfor input to 3.x format(). The decoding and subsequent encoding wouldnot be needed if 2.7 format(ob, byte-spec) were available.

  > all of the format string (not just the part between { and }) had
  > better use ASCII or an ASCII superset.

I am not even sure what you mean here. The bytes outside of 123 and 125are simply copied to the output string. There is no encoding orinterpretation involved.

It is true that the uninterpred bytes best not contain a byte patternmistakenly recognized as a replacement field. I plan to refine therelational expression byte pattern used in byteformat to sharply reducethe possibility of such errors. When such errors happen anyway, anexception should be raised, and I plan to expand the error message togive more diagnostic information.

And this (rightly) constrains the output to an ASCII superset as well.

What does this mean? I suspect I disagree. The bytes interpolated intothe output bytes can be any bytes.

Except that if you interpolate something like Shift JIS,

Bytes interpolation interpolates bytes, not encodings. Aself-identifying byte stream starts with bytes in a known encoding thatspecifies the encoding of the rest of the stream. Neither part need beencoded text. (Would that something like were standard for encoded textstreams, as well as for serialized images.)


>> [snip]

Right, that's the danger I was worried about, but the problem is that
there's at least *some* minimum level of ASCII compatibility that
needs to be assumed in order to define an interpolation format at all
(this is the point I originally missed).

I would put this sightly differently. To process bytes, we may definecertain bytes as metabytes with a special meaning. We may choose thebytes that happen to be the ascii encoding of certain characters. Butonce the special numbers are chosen, they are numbers, not characters.

The problem of metabytes having both a normal and special meaning issimilar to the problem of metacharacters having both a normal andspecial meaning.

For printf-style formatting,
it's % along with the various formatting characters and other syntax
(like digits, parentheses, variable names and "."), with the format
method it's braces, brackets, colons, variable names, etc.

It is the bytes corresponding to these characters. This is true also ofthe metabytes in an re module bytes pattern.

The mini-language parser has to assume in encoding

> in order to interpret the format string,

This is where I disagree with you and Guido. Bytes processing is donewith numbers 0 <= n <= 255, not characters. The fact that asciicharacters can, for convenience, be used in bytes literals to indicatethe corresponding ascii codes does not change this. A bytes parser looksfor certain special numbers. Other numbers need not be given anyinterpretation and need not represent encoded characters.


> and that's *all* done assuming an ASCII compatible format string

Since any bytes can be be regarded as an ascii-compatible latin-1encoded string, that seems like a vacuous assumption. In any case, I donot seen any particular assumption in the following, other than thechoice of replacement field delimiters.


>>> list(byteformat(bytes([1,2,10, 123, 125, 200]),
   (bytes([50, 100, 150]),)))
[1, 2, 10, 50, 100, 150, 200]

> (which must make life interesting if you try to use an

ASCII incompatible coding cookie for your source code - I'm actually
not sure what the full implications of that *are* for bytes literals
in Python 3).

An interesting and important question. The Python 2 manual says that thecoding cookie applies to only to comments and strings. To me, thissuggests that any encoding can be used. I am not sure how and when theencoding is applied. It suggests that the sequence of bytes resultingfrom a string literal is not determined by the sequence of characterscomprising the string literal, but also depends on the coding cookie.

The Python 3 manual says that the coding cookie applies to the wholesource file. To me, this says that the subset of unicode chars includedin the encoding *must* include the ascii characters. It also suggest tome that the encoding must also ascii-compatible, in order to read theencoding in the ascii-text coding cookie (unless there is a fallback tothe system encoding).

In any case, a 3.x source file is decoded to unicode. When the sequenceof unicode chars comprising a bytes literal is interpreted, theresulting sequence of bytes depends only on the literal and not the fileencoding. So list(b'()'), for instance, should always be [123, 125] in3.x. My comments above about byte processing assume that this is so.


--
Terry Jan Reedy

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

Reply via email to