Re: readdir() returns inaccessible name if file was created with invalid UTF-8

Corinna Vinschen via Cygwin Thu, 24 Jul 2025 11:39:23 -0700

On Jul 24 19:45, Thomas Wolff via Cygwin wrote:
> Am 24.07.2025 um 17:35 schrieb Corinna Vinschen:
> > Consider the following GB18030 string: 0x90 0x30 0x81 0x30
> > 
> > This string translates into a UTF-16 surrogate pair: 0xd800 0xdc00.
> > 
> > If you run a tweaked version of your test applicaton from
> > https://cygwin.com/pipermail/cygwin/2025-July/258513.html:
> > 
> >    setlocale (LC_CTYPE, "zh_CN.gb18030");
> >    mb (0x90);
> >    mb (0x30);
> >    mb (0x81);
> >    mb (0x30);
> > 
> > Then the output is:
> > 
> >    90 -> 0000 : -2
> >    30 -> 0000 : -2
> >    81 -> 0000 : -2
> >    30 -> D800 : 0
> > 
> > However, if you notice this situation...
> > 
> >    if (ret_from_mbrtowc == 0 && codeset == gb18030
> >        && (pwc & 0xfc00) == 0xd800)
> > 
> > ...you can just add a fake NUL byte:
> > 
> >      mbrtowc (&wc, '\0', 1, &mbstate);
> > 
> > If you do that, the above sequence becomes:
> > 
> >    90 -> 0000 : -2
> >    30 -> 0000 : -2
> >    81 -> 0000 : -2
> >    30 -> D800 : 0
> >    00 -> DC00 : 1
> > 
> > I hope this helps, if you didn't already handle GB18030 differently
> > in mintty.
> Oooff. No, I didn't. So that is already before 3.6.4 (and again 3.6.5),
> right?


Starting with 3.5.0 in fact.

> Thanks for the notice, I'll check and test your workaround.

No worries.  While I was testing the UTF-8 problem, I realized that
we have another strange encoding we're supporting for a short while.

GB18030 is tricky, because there's no such thing as a simple
mathematical conversion, as it is for UTF-8.  The 2nd and 4th bytes may
have position dependent meaning and could just as well represent an
ASCII char.  You can't simply search backwards in a string either.

As I wrote, you need all 4 bytes to allow conversion into UTF-16, so
a workaround as above is, unfortunately, necessary.


Corinna

-- 
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple

Re: readdir() returns inaccessible name if file was created with invalid UTF-8

Reply via email to