On 27.12.21 15:23, Adam D Ruppe wrote:
Let's look at:
"Hello 😂\n";
[...]
Finally, there's "string", which is utf-8, meaning each element is 8
bits, but again, there is a buffer you need to build up to get the code
points you feed into that VM.
[...]
H, e, l, l, o, <space>, <next point is combined by these bits PLUS THREE
MORE elements>, <this is a work-in-progress element and needs two more>,
<this is a work-in-progress element and needs one more>, <this is the
final work-in-progress element>, <new line>
[...]
Notice how each element here told you how many elements are left. This
is encoded into the bit pattern and is part of why it took 4 elements
instead of just three; there's some error-checking redundancy in there.
This is a nice part of the design allowing you to validate a utf-8
stream more reliably and even recover if you jumped somewhere in the
middle of a multi-byte sequence.
It's actually just the first byte that tells you how many are in the
sequence. The continuation bytes don't have redundancies for that.
To recover from the middle of a sequence, you just skip the orphaned
continuation bytes one at a time.