Let me answer you both since the issues are related.

On 1/14/2014 7:46 AM, Nick Coghlan wrote:

Guido van Rossum writes:
  > And that is precisely my point. When you're using a format string,

Bytes interpolation uses a bytes format, or a byte string if you will, but it should not be thought of as a character or text string. Certain bytes (123 and 125) delimit a replacement field. The bytes in between define, in my version, a format-spec after being ascii-decoded to text for input to 3.x format(). The decoding and subsequent encoding would not be needed if 2.7 format(ob, byte-spec) were available.

  > all of the format string (not just the part between { and }) had
  > better use ASCII or an ASCII superset.

I am not even sure what you mean here. The bytes outside of 123 and 125 are simply copied to the output string. There is no encoding or interpretation involved.

It is true that the uninterpred bytes best not contain a byte pattern mistakenly recognized as a replacement field. I plan to refine the relational expression byte pattern used in byteformat to sharply reduce the possibility of such errors. When such errors happen anyway, an exception should be raised, and I plan to expand the error message to give more diagnostic information.

And this (rightly) constrains the output to an ASCII superset as well.

What does this mean? I suspect I disagree. The bytes interpolated into the output bytes can be any bytes.

Except that if you interpolate something like Shift JIS,

Bytes interpolation interpolates bytes, not encodings. A self-identifying byte stream starts with bytes in a known encoding that specifies the encoding of the rest of the stream. Neither part need be encoded text. (Would that something like were standard for encoded text streams, as well as for serialized images.)

>> [snip]

Right, that's the danger I was worried about, but the problem is that
there's at least *some* minimum level of ASCII compatibility that
needs to be assumed in order to define an interpolation format at all
(this is the point I originally missed).

I would put this sightly differently. To process bytes, we may define certain bytes as metabytes with a special meaning. We may choose the bytes that happen to be the ascii encoding of certain characters. But once the special numbers are chosen, they are numbers, not characters.

The problem of metabytes having both a normal and special meaning is similar to the problem of metacharacters having both a normal and special meaning.

For printf-style formatting,
it's % along with the various formatting characters and other syntax
(like digits, parentheses, variable names and "."), with the format
method it's braces, brackets, colons, variable names, etc.

It is the bytes corresponding to these characters. This is true also of the metabytes in an re module bytes pattern.

The mini-language parser has to assume in encoding
> in order to interpret the format string,

This is where I disagree with you and Guido. Bytes processing is done with numbers 0 <= n <= 255, not characters. The fact that ascii characters can, for convenience, be used in bytes literals to indicate the corresponding ascii codes does not change this. A bytes parser looks for certain special numbers. Other numbers need not be given any interpretation and need not represent encoded characters.

> and that's *all* done assuming an ASCII compatible format string

Since any bytes can be be regarded as an ascii-compatible latin-1 encoded string, that seems like a vacuous assumption. In any case, I do not seen any particular assumption in the following, other than the choice of replacement field delimiters.

>>> list(byteformat(bytes([1,2,10, 123, 125, 200]),
   (bytes([50, 100, 150]),)))
[1, 2, 10, 50, 100, 150, 200]

> (which must make life interesting if you try to use an
ASCII incompatible coding cookie for your source code - I'm actually
not sure what the full implications of that *are* for bytes literals
in Python 3).

An interesting and important question. The Python 2 manual says that the coding cookie applies to only to comments and strings. To me, this suggests that any encoding can be used. I am not sure how and when the encoding is applied. It suggests that the sequence of bytes resulting from a string literal is not determined by the sequence of characters comprising the string literal, but also depends on the coding cookie.

The Python 3 manual says that the coding cookie applies to the whole source file. To me, this says that the subset of unicode chars included in the encoding *must* include the ascii characters. It also suggest to me that the encoding must also ascii-compatible, in order to read the encoding in the ascii-text coding cookie (unless there is a fallback to the system encoding).

In any case, a 3.x source file is decoded to unicode. When the sequence of unicode chars comprising a bytes literal is interpreted, the resulting sequence of bytes depends only on the literal and not the file encoding. So list(b'()'), for instance, should always be [123, 125] in 3.x. My comments above about byte processing assume that this is so.

--
Terry Jan Reedy

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to