Re: Fix broken UTF-8 decoding

2023-03-06 Thread Miod Vallat
> Currently it is not possible to use unicode codepoints > 0xFF on the console,
> because our UTF-8 decoding logic is badly broken.
> 
> The code in question is in wsemul_subr.c, wsemul_getchar().
> 
> The problem is that we calculate the number of bytes in a multi-byte
> sequence by just looking at the high bits in turn:
> 
>   if (frag & 0x20) {
>   frag &= ~0x20;
>   mbleft++;
>   }
>   if (frag & 0x10) {
>   frag &= ~0x10;
>   mbleft++;
>   }
>   if (frag & 0x08) {
>   frag &= ~0x08;
>   mbleft++;
>   }
>   if (frag & 0x04) {
>   frag &= ~0x04;
>   mbleft++;
>   }
> 
> This is wrong, for several reasons.

Doh! Thanks for noticing this. I have replaced that code with something
much saner now.

Miod



Re: Fix broken UTF-8 decoding

2023-02-25 Thread Crystal Kolipe
On Sat, Feb 25, 2023 at 08:29:54PM +0100, Steffen Nurpmeso wrote:
> Crystal Kolipe wrote in
>  :
>  |Currently it is not possible to use unicode codepoints > 0xFF on the \
>  |console,
>  |because our UTF-8 decoding logic is badly broken.
>  |
>  |The code in question is in wsemul_subr.c, wsemul_getchar().
>  |
>  |The problem is that we calculate the number of bytes in a multi-byte
>  |sequence by just looking at the high bits in turn:
>  ...
>  |This is wrong, for several reasons.
> 
> Just to note there are also holes, UTF-8 sequences are not
> necessarily well-formed (per se -- maybe they are when you control
> their generation, of course).  It is actually a real mess

Well, I did elude to further issues in my original post:

Crystal Kolipe  wrote:
> The UTF-8 decoder still needs more work done on it to reject invalid
> sequences such as over long encodings and the UTF-16 surrogates.

If people would rather wait to change this until I can fix the other issues in
the UTF-8 decoder then fine, but with what we've got in the tree at the
moment, it's not possible to use characters beyond 0xFF.  So the only use that
we can make of unicode is to display the existing 'extended ASCII' characters
using UTF-8 sequences instead of single character 8-bit ones.

With the patch I provided, it should be possible to add glyphs for non-latin
scripts, mathematical symbols, etc.

That in turn allows people to test userland applications with other character
sets, etc, highlight and fix any issues in those applications.

And note that this doesn't add any bloat to the kernel, because all of the
functionality needed to do this is already in there, (apart from the
relevant font, obviously).



Re: Fix broken UTF-8 decoding

2023-02-25 Thread Steffen Nurpmeso
Crystal Kolipe wrote in
 :
 |Currently it is not possible to use unicode codepoints > 0xFF on the \
 |console,
 |because our UTF-8 decoding logic is badly broken.
 |
 |The code in question is in wsemul_subr.c, wsemul_getchar().
 |
 |The problem is that we calculate the number of bytes in a multi-byte
 |sequence by just looking at the high bits in turn:
 ...
 |This is wrong, for several reasons.

Just to note there are also holes, UTF-8 sequences are not
necessarily well-formed (per se -- maybe they are when you control
their generation, of course).  It is actually a real mess:

if(LIKELY(x <= 0x7Fu))
c = x;
/* 0xF8, but Unicode guarantees maximum of 0x10u -> F4 8F BF BF.
 * Unicode 9.0, 3.9, UTF-8, Table 3-7. Well-Formed UTF-8 Byte Sequences 
*/
else if(LIKELY(x > 0xC0u && x <= 0xF4u)){
if(LIKELY(x < 0xE0u)){
if(UNLIKELY(l < 1))
goto jenobuf;
--l;

c = (x &= 0x1Fu);
}else if(LIKELY(x < 0xF0u)){
if(UNLIKELY(l < 2))
goto jenobuf;
l -= 2;

x1 = x;
c = (x &= 0x0Fu);

/* Second byte constraints */
x = S(u8,*cp++);
switch(x1){
case 0xE0u:
if(UNLIKELY(x < 0xA0u || x > 0xBFu))
goto jerr;
break;
case 0xEDu:
if(UNLIKELY(x < 0x80u || x > 0x9Fu))
goto jerr;
break;
default:
if(UNLIKELY((x & 0xC0u) != 0x80u))
goto jerr;
break;
}
c <<= 6;
c |= (x &= 0x3Fu);
}else{
if(UNLIKELY(l < 3))
goto jenobuf;
l -= 3;

x1 = x;
c = (x &= 0x07u);

/* Third byte constraints */
x = S(u8,*cp++);
switch(x1){
case 0xF0u:
if(UNLIKELY(x < 0x90u || x > 0xBFu))
goto jerr;
break;
case 0xF4u:
if(UNLIKELY((x & 0xF0u) != 0x80u)) /* 80..8F */
goto jerr;
break;
default:
if(UNLIKELY((x & 0xC0u) != 0x80u))
goto jerr;
break;
}
c <<= 6;
c |= (x &= 0x3Fu);

x = S(u8,*cp++);
if(UNLIKELY((x & 0xC0u) != 0x80u))
goto jerr;
c <<= 6;
c |= (x &= 0x3Fu);
}

x = S(u8,*cp++);
if(UNLIKELY((x & 0xC0u) != 0x80u))
goto jerr;
c <<= 6;
c |= x & 0x3Fu;
}else
goto jerr;

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)