On 27.12.21 15:23, Adam D Ruppe wrote:
Let's look at:

"Hello 😂\n";
[...]
Finally, there's "string", which is utf-8, meaning each element is 8 bits, but again, there is a buffer you need to build up to get the code points you feed into that VM.
[...]
H, e, l, l, o, <space>, <next point is combined by these bits PLUS THREE MORE elements>, <this is a work-in-progress element and needs two more>, <this is a work-in-progress element and needs one more>, <this is the final work-in-progress element>, <new line>
[...]
Notice how each element here told you how many elements are left. This is encoded into the bit pattern and is part of why it took 4 elements instead of just three; there's some error-checking redundancy in there. This is a nice part of the design allowing you to validate a utf-8 stream more reliably and even recover if you jumped somewhere in the middle of a multi-byte sequence.

It's actually just the first byte that tells you how many are in the sequence. The continuation bytes don't have redundancies for that.

To recover from the middle of a sequence, you just skip the orphaned continuation bytes one at a time.

Reply via email to