On 14/6/2013 1:46 πμ, Dennis Lee Bieber wrote:
On Wed, 12 Jun 2013 09:09:05 +0000 (UTC), ???????? ??????
<supp...@superhost.gr> declaimed the following:
(*) infact UTF8 also indicates the end of each character
Up to a point. The initial byte encodes the length and the top few
bits, but the subsequent octets aren’t distinguishable as final in
isolation. 0x80-0xBF can all be either medial or final.
So, the first high-bits are a directive that UTF-8 uses to know how many
bytes each character is being represented as.
0-127 codepoints(characters) use 1 bit to signify they need 1 bit for
storage and the rest 7 bits to actually store the character ?
Not quite... The leading bit is a 0 -> which means 0..127 are sent
as-is, no manipulation.
So, in utf-8, the leading bit which is a zero 0, its actually a flag to
tell that the code-point needs 1 byte to be stored and the rest 7 bits
is for the actual value of 0-127 code-points ?
128-256 codepoints(characters) use 2 bit to signify they need 2 bits for
storage and the rest 14 bits to actually store the character ?
128..255 -- in what encoding? These all have the leading bit with a
value of 1. In 8-bit encodings (ISO-Latin-1) the meaning of those values is
inherent in the specified encoding and they are sent as-is.
So, latin-iso or greek-iso, the leading 0 is not a flag like it is in
utf-8 encoding because latin-iso and greek-iso and all *-iso use all 8
bits for storage?
But, in utf-8, the leading bit, which is 1, is to tell that the
code-point needs 2 byte to be stored and the rest 7 bits is for the
actual value of 128-255 code-points ?
But why 2 bytes? leading 1 is a flag and the rest 7 bits can hold the
encoded value.
Bu that is not the case since we know that utf-8 needs 2 bytes to store
code-points 127-255
1110 starts a three byte sequence, 11110 starts a four byte sequence...
Basically, count the number of leading 1-bits before a 0 bit, and that
tells you how many bytes are in the multi-byte sequence -- and all bytes
that start with 10 are supposed to be the continuations of a multibyte set
(and not a signal that this is a 1-byte entry -- those only have a leading
0)
Why doesn't it work like this?
leading 0 = 1 byte flag
leading 1 = 2 bytes flag
leading 00 = 3 bytes flag
leading 01 = 4 bytes flag
leading 10 = 5 bytes flag
leading 11 = 6 bytes flag
Wouldn't it be more logical?
Original UTF-8 allowed for 31-bits to specify a character in the Unicode
set. It used 6 bytes -- 48 bits total, but 7 bits of the first byte were
the flag (6 leading 1 bits and a 0 bit), and two bits (leading 10) of each
continuation.
utf8 6 byted = 48 bits - 7 bits(from first bytes) - 2 bits(for each
continuation) * 5 = 48 - 7 - 10 = 31 bits indeed to store the actual
code-point. But 2^31 is still a huge number to store any kind of
character isnt it?
--
What is now proved was at first only imagined!
--
http://mail.python.org/mailman/listinfo/python-list