Re: Fix broken UTF-8 decoding

Crystal Kolipe Sat, 25 Feb 2023 13:09:56 -0800

On Sat, Feb 25, 2023 at 08:29:54PM +0100, Steffen Nurpmeso wrote:
> Crystal Kolipe wrote in
>  <Y/pgctbyhmrx+...@exoticsilicon.com>:
>  |Currently it is not possible to use unicode codepoints > 0xFF on the \
>  |console,
>  |because our UTF-8 decoding logic is badly broken.
>  |
>  |The code in question is in wsemul_subr.c, wsemul_getchar().
>  |
>  |The problem is that we calculate the number of bytes in a multi-byte
>  |sequence by just looking at the high bits in turn:
>  ...
>  |This is wrong, for several reasons.
> 
> Just to note there are also holes, UTF-8 sequences are not
> necessarily well-formed (per se -- maybe they are when you control
> their generation, of course).  It is actually a real mess

Well, I did elude to further issues in my original post:

Crystal Kolipe <kolip...@exoticsilicon.com> wrote:
> The UTF-8 decoder still needs more work done on it to reject invalid
> sequences such as over long encodings and the UTF-16 surrogates.

If people would rather wait to change this until I can fix the other issues in
the UTF-8 decoder then fine, but with what we've got in the tree at the
moment, it's not possible to use characters beyond 0xFF.  So the only use that
we can make of unicode is to display the existing 'extended ASCII' characters
using UTF-8 sequences instead of single character 8-bit ones.

With the patch I provided, it should be possible to add glyphs for non-latin
scripts, mathematical symbols, etc.

That in turn allows people to test userland applications with other character
sets, etc, highlight and fix any issues in those applications.

And note that this doesn't add any bloat to the kernel, because all of the
functionality needed to do this is already in there, (apart from the
relevant font, obviously).

Re: Fix broken UTF-8 decoding

Reply via email to