Can we just avoid converting to wide char at all and operate only in MBCS? IsDBCSLeadByte should be enough to allow these functions to skip any false matches on the second byte of double-byte chars. And it does not matter that IsDBCSLeadByte doesn't work with UTF-8, because the UTF-8 encoding already ensures that there will be no false matches with 7-bit ASCII chars (all bytes forming multi-byte chars have the MSB set, unlike some DBCS).

On 23/3/2023 13:29, LIU Hao wrote:
在 2023-03-22 13:40, 傅继晗 写道:
Hello:
There is no need to convert multi-byte characters to wide-byte characters and then convert from wide-byte to multi-byte, just deal with multi-byte directly as __xpg_basename in gnu as this: glibc/xpg_basename.c at master · lattera/glibc (github.com) <https://github.com/lattera/glibc/blob/master/stdlib/xpg_basename.c> Converting multi-byte characters to wide-byte characters can lead to garbled code problems if the incoming filename encoding does not match the system default encoding returned by GetACP().And environment variables do not work there.

I was also thinking about reimplementation of these functions, but there are things that we must take care of:

1. The conversion from multibyte character strings (MBCS) to wide character    strings (WCS) is necessary, because non-leading bytes in some MBCS encodings
   may match the black slash `\` (U+005C).

2. Not only forward and backward slashes are path separators. The conventional    Shift JIS encoding has the Yen symbol `¥` (U+00A5) take the place of the    ASCII backslash, and a Yen symbol is displayed for the byte 0x5C if a Shift    JIS locale is activated. This means that `mbstowc()` will convert the string    "a¥b" (hex: 61 5C 62) to L"a¥b" (hex: 0061 00A5 0062) and we have to also
   accept `¥` as a path separator in Japanese locales.

3. Something similar to 2 happens about the Won symbol `₩` (U+20A9) in Korean
   locales, so we have to accept it, too.

4. Don't ever try to modify the global locale, due to thread safety.


OK, it has been too much. How do we implement this correctly? First we need to make an assumption about the input, for example, let's assume it is a path in Shift JIS encoding.

1. Don't check wide characters for path separators! As explained above, not only    slashes are path separators. Japanese and Korean are what I happen to know,    but there could be a couple more. We notice that the byte 5C is always a path    separator, no matter what it may map to, `\`, `¥` or `₩`. Hence, only the    original MBCS should be scanned for path separators, which are `/` (U+002F)
   and `\` (U+005C), but nothing else.

2. It's necessary to convert it to a WCS first. Ideally the caller should have    set the global locale... but are they aware of it? Maybe this should be done
   via `MultiByteToWideChar()`, like other functions.

3. So what is the conversion for? We can't make use of the output of such
   conversion, but it gives information about how many bytes that a character    takes in the original MBCS, so we can know how many bytes to move forward (a    multiple-byte character is never a path separator), and will not a mistake
   non-leading byte as a path separator.





_______________________________________________
Mingw-w64-public mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/mingw-w64-public


_______________________________________________
Mingw-w64-public mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/mingw-w64-public

Reply via email to