[Python-ideas] Re: prefix/suffix for bytes (was: New explicit methods to trim strings)

Chris Angelico Wed, 11 Mar 2020 03:41:11 -0700

On Wed, Mar 11, 2020 at 9:28 PM Steven D'Aprano <st...@pearwood.info> wrote:
>
> On Wed, Mar 11, 2020 at 07:28:06AM +1100, Chris Angelico wrote:
>
> > That's exactly what "ASCII compatible" means. Since ASCII is a
> > seven-bit encoding, an encoding is ASCII-compatible if (a) every ASCII
> > character is represented by the corresponding byte value, and (b)
> > every seven-bit value represents that ASCII character.
>
> Sorry Chris, that explanation left me more confused than I started :-(
>
> Let me have a go...
>
> The ASCII encoding is a mapping between *seven-bit numeric values* and
> 128 distinct characters, some of which are human-readable:
>
>     A = 1000001
>     B = 1000010
>     a = 1100001
>
> and some of which are considered to be "binary" characters:
>
>     NUL = 0000000
>     SOH = 0000001
>     DEL = 1111111


Correct.

> In practice today, seven bits are inconvenient, so these are always
> padded with a leading 0 bit.

Yes, since there's no practical way to represent ASCII characters in
seven-bit units, so we store those numbers in eight-bit bytes.

> An encoding is compatible with ASCII if, and only if, the following is
> true:
>
> * all 128 of the ASCII characters are handled by the encoding;
>
> * each of those characters are mapped to the same eight-bit value as
>   the ASCII encoding would use (including the leading 0 bit);

Correct - this is my "(a)" condition

> * no non-ASCII character is mapped to one of those eight-bit values;
>
> * or to something which could be confused with one of those eight-bit
>   values by a naive application that processed them a byte at a time.

And corect - this is my "(b)" condition. Any byte value below 128 must
represent the corresponding ASCII character, and nothing else.

> E.g. if an encoding mapped some character ∇ to the 16-bit value:
>
>     01000001 11110000
>
> that would not be considered ASCII-compatible, because the first byte
> would be interpreted as "A" by a naive application.

Exactly.

> Most (all?) of the "extended ASCII" eight-bit encodings are ASCII
> compatible, because they use only bytes with a leading 1 for the
> non-ASCII characters.

Right. ASCII-compatible and a single-byte encoding, simple,
straight-forward, and easy to work with. But, of course, limited to
just 128 non-ASCII characters.

> UTF-8 is also ASCII compatible.
>
> UTF-16 and UTF-32 are *not* ASCII compatible.
>
> How did I go?

Nailed it. And explained it far more clearly than I did.

ChrisA
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/7E7W7LMYVSLKP7P2X6HVRO4NDK2SZCXS/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: prefix/suffix for bytes (was: New explicit methods to trim strings)

Reply via email to