在 2023-03-22 13:40, 傅继晗 写道:
Hello:There is no need to convert multi-byte characters to wide-byte characters and then convert from wide-byte to multi-byte, just deal with multi-byte directly as __xpg_basename in gnu as this: glibc/xpg_basename.c at master · lattera/glibc (github.com) <https://github.com/lattera/glibc/blob/master/stdlib/xpg_basename.c> Converting multi-byte characters to wide-byte characters can lead to garbled code problems if the incoming filename encoding does not match the system default encoding returned by GetACP().And environment variables do not work there.
I was also thinking about reimplementation of these functions, but there are things that we must take care of:
1. The conversion from multibyte character strings (MBCS) to wide character strings (WCS) is necessary, because non-leading bytes in some MBCS encodings may match the black slash `\` (U+005C). 2. Not only forward and backward slashes are path separators. The conventional Shift JIS encoding has the Yen symbol `¥` (U+00A5) take the place of the ASCII backslash, and a Yen symbol is displayed for the byte 0x5C if a Shift JIS locale is activated. This means that `mbstowc()` will convert the string "a¥b" (hex: 61 5C 62) to L"a¥b" (hex: 0061 00A5 0062) and we have to also accept `¥` as a path separator in Japanese locales. 3. Something similar to 2 happens about the Won symbol `₩` (U+20A9) in Korean locales, so we have to accept it, too. 4. Don't ever try to modify the global locale, due to thread safety.OK, it has been too much. How do we implement this correctly? First we need to make an assumption about the input, for example, let's assume it is a path in Shift JIS encoding.
1. Don't check wide characters for path separators! As explained above, not only slashes are path separators. Japanese and Korean are what I happen to know, but there could be a couple more. We notice that the byte 5C is always a path separator, no matter what it may map to, `\`, `¥` or `₩`. Hence, only the original MBCS should be scanned for path separators, which are `/` (U+002F) and `\` (U+005C), but nothing else. 2. It's necessary to convert it to a WCS first. Ideally the caller should have set the global locale... but are they aware of it? Maybe this should be done via `MultiByteToWideChar()`, like other functions. 3. So what is the conversion for? We can't make use of the output of such conversion, but it gives information about how many bytes that a character takes in the original MBCS, so we can know how many bytes to move forward (a multiple-byte character is never a path separator), and will not a mistake non-leading byte as a path separator. -- Best regards, LIU Hao
OpenPGP_signature
Description: OpenPGP digital signature
_______________________________________________ Mingw-w64-public mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/mingw-w64-public
