Re: imaplib: is this really so unwieldy?

Terry Reedy Tue, 25 May 2021 15:39:27 -0700

On 5/25/2021 1:25 PM, MRAB wrote:

On 2021-05-25 16:41, Dennis Lee Bieber wrote:

In Python 3, strings are UNICODE, using 1, 2, or 4 bytes PERCHARACTER

This is CPython 3.3+ specific. Before than, it depended on the OS. Ibelieve MicroPython uses utf-8 for strings.

(I don't recall if there is a 3-byte version).


There isn't.  It would save space but cost time.

If your input bytes are all
7-bit ASCII, then they map directly to a 1-byte per character string.

If your input bytes all have the upper bit 0 and they are interpreted asencoding ascii characters then they map to overhead + 1 byte per char


>>> sys.getsizeof(b''.decode('ascii'))
49
>>> sys.getsizeof(b'a'.decode('ascii'))
50
>>> sys.getsizeof(11*b'a'.decode('ascii'))
60

If
they contain any 8-bit upper half character they may map into a 2-byteper character string.


See below.

In CPython 3.3+:

U+0000..U+00FF are stored in 1 byte.
U+0100..U+FFFF are stored in 2 bytes.
U+010000..U+10FFFF are stored in 4 bytes.

In CPython's Flexible String Representation all characters in a stringare stored with the same number of bytes, depending on the largestcodepoint.


>>> sys.getsizeof('\U00011111')
80
>>> sys.getsizeof('\U00011111'*2)
84
>>> sys.getsizeof('a\U00011111')
84

Bytes in Python 3 are just a binary stream, which needs anencoding to produce characters.


Or any other Python object.

Use the wrong encoding (say ISO-Latin-1) when thedata is really UTF-8 will result in garbage.


So does decoding bytes as text when the bytes encode something else,
such as an image ;-).


--
Terry Jan Reedy


--
https://mail.python.org/mailman/listinfo/python-list

Re: imaplib: is this really so unwieldy?

Reply via email to