Re: A few questiosn about encoding

Nick the Gr33k Thu, 13 Jun 2013 22:38:59 -0700

On 14/6/2013 1:46 πμ, Dennis Lee Bieber wrote:

On Wed, 12 Jun 2013 09:09:05 +0000 (UTC), ???????? ??????
<supp...@superhost.gr> declaimed the following:

(*) infact UTF8 also indicates the end of each character

Up to a point.  The initial byte encodes the length and the top few
bits, but the subsequent octets aren’t distinguishable as final in
isolation.  0x80-0xBF can all be either medial or final.



So, the first high-bits are a directive that UTF-8 uses to know how many
bytes each character is being represented as.

0-127 codepoints(characters) use 1 bit to signify they need 1 bit for
storage and the rest 7 bits to actually store the character ?

        Not quite... The leading bit is a 0 -> which means 0..127 are sent
as-is, no manipulation.

So, in utf-8, the leading bit which is a zero 0, its actually a flag totell that the code-point needs 1 byte to be stored and the rest 7 bitsis for the actual value of 0-127 code-points ?

128-256 codepoints(characters) use 2 bit to signify they need 2 bits for
storage and the rest 14 bits to actually store the character ?

        128..255 -- in what encoding? These all have the leading bit with a
value of 1. In 8-bit encodings (ISO-Latin-1) the meaning of those values is
inherent in the specified encoding and they are sent as-is.

So, latin-iso or greek-iso, the leading 0 is not a flag like it is inutf-8 encoding because latin-iso and greek-iso and all *-iso use all 8bits for storage?

But, in utf-8, the leading bit, which is 1, is to tell that thecode-point needs 2 byte to be stored and the rest 7 bits is for theactual value of 128-255 code-points ?

But why 2 bytes? leading 1 is a flag and the rest 7 bits can hold theencoded value.

Bu that is not the case since we know that utf-8 needs 2 bytes to storecode-points 127-255

        1110 starts a three byte sequence, 11110 starts a four byte sequence...
Basically, count the number of leading 1-bits before a 0 bit, and that
tells you how many bytes are in the multi-byte sequence -- and all bytes
that start with 10 are supposed to be the continuations of a multibyte set
(and not a signal that this is a 1-byte entry -- those only have a leading
0)


Why doesn't it work like this?

leading 0 = 1 byte flag
leading 1 = 2 bytes flag
leading 00 = 3 bytes flag
leading 01 = 4 bytes flag
leading 10 = 5 bytes flag
leading 11 = 6 bytes flag

Wouldn't it be more logical?

Original UTF-8 allowed for 31-bits to specify a character in the Unicode
set. It used 6 bytes -- 48 bits total, but 7 bits of the first byte were
the flag (6 leading 1 bits and a 0 bit), and two bits (leading 10) of each
continuation.

utf8 6 byted = 48 bits - 7 bits(from first bytes) - 2 bits(for eachcontinuation) * 5 = 48 - 7 - 10 = 31 bits indeed to store the actualcode-point. But 2^31 is still a huge number to store any kind ofcharacter isnt it?






--
What is now proved was at first only imagined!
--
http://mail.python.org/mailman/listinfo/python-list

Re: A few questiosn about encoding

Reply via email to