Yes, I totally agree with that. In that way, the UTF-8 filename would be
processed without errors when the system default is  DBCS locale.

Alvin Wong <[email protected]> 于2023年3月25日周六 12:35写道:

> Can we just avoid converting to wide char at all and operate only in
> MBCS? IsDBCSLeadByte should be enough to allow these functions to skip
> any false matches on the second byte of double-byte chars. And it does
> not matter that IsDBCSLeadByte doesn't work with UTF-8, because the
> UTF-8 encoding already ensures that there will be no false matches with
> 7-bit ASCII chars (all bytes forming multi-byte chars have the MSB set,
> unlike some DBCS).
>
>
> On 23/3/2023 13:29, LIU Hao wrote:
> > 在 2023-03-22 13:40, 傅继晗 写道:
> >> Hello:
> >> There is no need to convert multi-byte characters to wide-byte
> >> characters and then convert from wide-byte to multi-byte, just deal
> >> with multi-byte directly as __xpg_basename in gnu as this:
> >> glibc/xpg_basename.c at master · lattera/glibc (github.com)
> >> <https://github.com/lattera/glibc/blob/master/stdlib/xpg_basename.c>
> >> Converting multi-byte characters to wide-byte characters can lead to
> >> garbled code problems if the incoming filename encoding does not
> >> match the system default encoding returned by GetACP().And
> >> environment variables do not work there.
> >
> > I was also thinking about reimplementation of these functions, but
> > there are things that we must take care of:
> >
> > 1. The conversion from multibyte character strings (MBCS) to wide
> > character
> >    strings (WCS) is necessary, because non-leading bytes in some MBCS
> > encodings
> >    may match the black slash `\` (U+005C).
> >
> > 2. Not only forward and backward slashes are path separators. The
> > conventional
> >    Shift JIS encoding has the Yen symbol `¥` (U+00A5) take the place
> > of the
> >    ASCII backslash, and a Yen symbol is displayed for the byte 0x5C if
> > a Shift
> >    JIS locale is activated. This means that `mbstowc()` will convert
> > the string
> >    "a¥b" (hex: 61 5C 62) to L"a¥b" (hex: 0061 00A5 0062) and we have
> > to also
> >    accept `¥` as a path separator in Japanese locales.
> >
> > 3. Something similar to 2 happens about the Won symbol `₩` (U+20A9) in
> > Korean
> >    locales, so we have to accept it, too.
> >
> > 4. Don't ever try to modify the global locale, due to thread safety.
> >
> >
> > OK, it has been too much. How do we implement this correctly? First we
> > need to make an assumption about the input, for example, let's assume
> > it is a path in Shift JIS encoding.
> >
> > 1. Don't check wide characters for path separators! As explained
> > above, not only
> >    slashes are path separators. Japanese and Korean are what I happen
> > to know,
> >    but there could be a couple more. We notice that the byte 5C is
> > always a path
> >    separator, no matter what it may map to, `\`, `¥` or `₩`. Hence,
> > only the
> >    original MBCS should be scanned for path separators, which are `/`
> > (U+002F)
> >    and `\` (U+005C), but nothing else.
> >
> > 2. It's necessary to convert it to a WCS first. Ideally the caller
> > should have
> >    set the global locale... but are they aware of it? Maybe this
> > should be done
> >    via `MultiByteToWideChar()`, like other functions.
> >
> > 3. So what is the conversion for? We can't make use of the output of such
> >    conversion, but it gives information about how many bytes that a
> > character
> >    takes in the original MBCS, so we can know how many bytes to move
> > forward (a
> >    multiple-byte character is never a path separator), and will not a
> > mistake
> >    non-leading byte as a path separator.
> >
> >
> >
> >
> >
> > _______________________________________________
> > Mingw-w64-public mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/mingw-w64-public
>

_______________________________________________
Mingw-w64-public mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/mingw-w64-public

Reply via email to