Re: Multibyte conversion bugs and fixes

Neil Booth Fri, 25 Apr 2008 22:59:14 -0700

Marc Lehmann wrote:-

> On Sat, Apr 26, 2008 at 10:34:10AM +0900, Neil Booth <[EMAIL PROTECTED]> 
> wrote:
> 
> Thanks for digging out this issue :) One minor nitpick, however:


No problem.  I encountered the same issue trying to get one of my
own apps to handle multibyte conversion properly.  In doing so I
found bugs in NetBSD too.  And having had the experience of requiring
a state reset on error, I realized that is probably what was missing
in rxvt-unicode when I had issues with mojibake with it.  I just
grepped the source looking for these functions, and got a bit
overzealous "fixing" them as you point out.

> > @@ -2380,13 +2380,19 @@ rxvt_term::next_char () NOTHROW
> >  
> >        if (len == (size_t)-2)
> >          {
> > +     // Reset to initial conversion state from undefined
> > +     mbrtowc (0, 0, 0, mbstate);
> 
> This would introduce a bug (it would eat characters at the end of the
> buffer). Are you sure netbsd requires this? If yes, then netbsd is buggy.

I'm not sure about your point about a buffer, I assume this is related
to the context in which rxvt uses this function.  I've not looked into
that, but you're right that we shouldn't be resetting state on a return
of -2; the characters were fine just there wasn't enough of them, and
the function remains in mid-multibyte-character state ready for the
remaining bytes.  My mistake.

> >        if (len == (size_t)-1)
> > +   {
> > +     // Reset to initial conversion state from undefined
> > +     mbrtowc (0, 0, 0, mbstate);
> 
> well spotted, this is a real bug! thanks for tracking it down.
> 
> (gnu/linux manpages are abysmally bad)

I agree this is a bug.  The C standard is not great either in this area...
 
> >        if (l < 0)
> > +      {
> > +   // Reset to initial conversion state from undefined
> > +   wcrtomb (0, 0, mbs);
> 
> 
> this is a bug, too!

Agreed.
 
> however, should not, in theory, in this case, the buffer returned by
> wcrtomb added to the string (wcrtomb returns a reste shift state sequence
> in this case)?

Indeed, you're right, if I understand what you're saying the attached
revised patches are probably better.  I can confirm that that
rxvt-unicode with these revised patches applied also behaves
properly on NetBSD after cat-ting a file badly encoded for the current
locale.

> although, after na illegal sequence was encountered, it doesn't really
> matter in what state the conversion really is.
> 
> Both fixes are applied and will be part of the next release, which will
> include a fix for the select backend used on netbsd as well... :)

That's great, thanks!  Not sure what the select backend is though....

Neil.

$NetBSD$

--- src/command.C.orig  2008-04-26 14:51:29.000000000 +0900
+++ src/command.C
@@ -2386,7 +2386,11 @@ rxvt_term::next_char () NOTHROW
         }
 
       if (len == (size_t)-1)
+      {
+       // Return to well-defined initial shift state
+        mbrtowc (0, 0, 0, mbstate);
         return (unsigned char)*cmdbuf_ptr++; // the _occasional_ latin1 
character is allowed to slip through
+      }
 
       // assume wchar == unicode
       cmdbuf_ptr += len;

$NetBSD$

--- src/misc.C.orig     2008-04-26 14:51:36.000000000 +0900
+++ src/misc.C
@@ -39,10 +39,12 @@ rxvt_wcstombs (const wchar_t *str, int l
     {
       ssize_t l = wcrtomb (dst, *str++, mbs);
 
+      // On encoding error, return to initial shift state for both
+      // our buffer and wcrtomb's internal state.
       if (l < 0)
-        *dst++ = '?';
-      else
-        dst += l;
+       l = wcrtomb (dst, 0, mbs);
+
+      dst += l;
     }
 
   *dst++ = 0;

_______________________________________________
rxvt-unicode mailing list
[email protected]
http://lists.schmorp.de/cgi-bin/mailman/listinfo/rxvt-unicode

Re: Multibyte conversion bugs and fixes

Reply via email to