Multibyte conversion bugs and fixes

Neil Booth Fri, 25 Apr 2008 20:03:38 -0700

Hi!  Thanks for rxvt-unicode.

rxvt-unicode does not do correct error recovery on encountering
character set conversion errors.  The C standard states that the
conversion state after such errors is undefined; hence it is
necessary to explicitly reset to the initial conversion state to
get well-defined behaviour post-error.


On NetBSD, the C library does indeed leave the conversion state in
a bad state; in fact all subsequent attempts to convert with that bad
state object will fail.  It is necessary to reset state explicitly.
You can argue that this is poor on the part of the C library; I may
agree, and presumably glibc recovers better, but the fact is NetBSD's
behaviour is valid according to the C standard and rxvt-unicode is
incorrect in assuming it can continue.  You may even argue NetBSD's
behaviour is nice because it catches these coding assumptions.

Anyway, please consider applying the two attached patches for the next
release.  On NetBSD, having any bad character output to the terminal
renders that terminal sesssion useless with rxvt-unicode in its
current form; I need to start a new session to read Japanese files
again.  As shown below; this can happen simply viewing a file
encoded differently to what I expected.

How to reproduce this on NetBSD:

1) LC_CTYPE=ja_JP.UTF-8 urxvt
2) cat a utf-8 file; looks good!
3) cat an euc-jp file; mojibake only!
4) cat the same utf-8 file in 2), mojibake only!

With the patches below:

1) LC_CTYPE=ja_JP.UTF-8 urxvt
2) cat a utf-8 file; looks good!
3) cat an euc-jp file; mojibake only!
4) cat the same utf-8 file in 2), looks good!

If you require the two sample files I can provide them, but anything
with more than one or two Japanese characters should suffice.

Thanks!

Neil.

$NetBSD$

--- src/command.C.orig  2008-04-26 10:10:05.000000000 +0900
+++ src/command.C
@@ -2380,13 +2380,19 @@ rxvt_term::next_char () NOTHROW
 
       if (len == (size_t)-2)
         {
+         // Reset to initial conversion state from undefined
+         mbrtowc (0, 0, 0, mbstate);
           // the mbstate stores incomplete sequences. didn't know this :/
           cmdbuf_ptr = cmdbuf_endp;
           break;
         }
 
       if (len == (size_t)-1)
+       {
+         // Reset to initial conversion state from undefined
+         mbrtowc (0, 0, 0, mbstate);
         return (unsigned char)*cmdbuf_ptr++; // the _occasional_ latin1 
character is allowed to slip through
+       }
 
       // assume wchar == unicode
       cmdbuf_ptr += len;

$NetBSD$

--- src/misc.C.orig     2008-04-26 10:10:56.000000000 +0900
+++ src/misc.C
@@ -40,7 +40,11 @@ rxvt_wcstombs (const wchar_t *str, int l
       ssize_t l = wcrtomb (dst, *str++, mbs);
 
       if (l < 0)
+      {
+       // Reset to initial conversion state from undefined
+       wcrtomb (0, 0, mbs);
         *dst++ = '?';
+      }
       else
         dst += l;
     }

_______________________________________________
rxvt-unicode mailing list
[email protected]
http://lists.schmorp.de/cgi-bin/mailman/listinfo/rxvt-unicode

Multibyte conversion bugs and fixes

Reply via email to