On Thu, 13 Jun 2013 09:09:19 +0300, Νικόλαος Κούρας wrote: > On 13/6/2013 3:13 πμ, Steven D'Aprano wrote:
>> Open an interactive Python session, and run this code: >> >> c = ord(16474) >> len(c.encode('utf-8')) >> >> >> That will tell you how many bytes are used for that example. > This si actually wrong. > > ord()'s arguments must be a character for which we expect its ordinal > value. Gah! That's twice I've screwed that up. Sorry about that! > >>> chr(16474) > '䁚' > > Some Chinese symbol. > So code-point '䁚' has a Unicode ordinal value of 16474, correct? Correct. > where in after encoding this glyph's ordinal value to binary gives us > the following bytes: > > >>> bin(16474).encode('utf-8') > b'0b100000001011010' No! That creates a string from 16474 in base two: '0b100000001011010' The leading 0b is just syntax to tell you "this is base 2, not base 8 (0o) or base 10 or base 16 (0x)". Also, leading zero bits are dropped. Then you encode the string '0b100000001011010' into UTF-8. There are 17 characters in this string, and they are all ASCII characters to they take up 1 byte each, giving you bytes b'0b100000001011010' (in ASCII form). In hex form, they are: b'\x30\x62\x31\x30\x30\x30\x30\x30\x30\x30\x31\x30\x31\x31\x30\x31\x30' which takes up a lot more room, which is why Python prefers to show ASCII characters as characters rather than as hex. What you want is: chr(16474).encode('utf-8') [...] > Thus, there we count 15 bits left. > So it says 15 bits, which is 1-bit less that 2 bytes. Is the above > statements correct please? No. There are 17 BYTES there. The string "0" doesn't get turned into a single bit. It still takes up a full byte, 0x30, which is 8 bits. > but thinking this through more and more: > > >>> chr(16474).encode('utf-8') > b'\xe4\x81\x9a' > >>> len(b'\xe4\x81\x9a') > 3 > > it seems that the bytestring the encode process produces is of length 3. Correct! Now you have got the right idea. -- Steven -- http://mail.python.org/mailman/listinfo/python-list