Thanks for your attention! I really understand this time.But I still have a question,why dirname in linux glib doesn't process MBCS? Maybe they think input as UTF-8 as default? LIU Hao <[email protected]> 于2023年3月23日周四 13:30写道:
> 在 2023-03-22 13:40, 傅继晗 写道: > > Hello: > > There is no need to convert multi-byte characters to wide-byte > characters and then convert from > > wide-byte to multi-byte, just deal with multi-byte directly as > __xpg_basename in gnu as this: > > glibc/xpg_basename.c at master · lattera/glibc (github.com) > > <https://github.com/lattera/glibc/blob/master/stdlib/xpg_basename.c> > > Converting multi-byte characters to wide-byte characters can lead to > garbled code problems if the > > incoming filename encoding does not match the system default encoding > returned by GetACP().And > > environment variables do not work there. > > I was also thinking about reimplementation of these functions, but there > are things that we must > take care of: > > 1. The conversion from multibyte character strings (MBCS) to wide character > strings (WCS) is necessary, because non-leading bytes in some MBCS > encodings > may match the black slash `\` (U+005C). > > 2. Not only forward and backward slashes are path separators. The > conventional > Shift JIS encoding has the Yen symbol `¥` (U+00A5) take the place of > the > ASCII backslash, and a Yen symbol is displayed for the byte 0x5C if a > Shift > JIS locale is activated. This means that `mbstowc()` will convert the > string > "a¥b" (hex: 61 5C 62) to L"a¥b" (hex: 0061 00A5 0062) and we have to > also > accept `¥` as a path separator in Japanese locales. > > 3. Something similar to 2 happens about the Won symbol `₩` (U+20A9) in > Korean > locales, so we have to accept it, too. > > 4. Don't ever try to modify the global locale, due to thread safety. > > > OK, it has been too much. How do we implement this correctly? First we > need to make an assumption > about the input, for example, let's assume it is a path in Shift JIS > encoding. > > 1. Don't check wide characters for path separators! As explained above, > not only > slashes are path separators. Japanese and Korean are what I happen to > know, > but there could be a couple more. We notice that the byte 5C is always > a path > separator, no matter what it may map to, `\`, `¥` or `₩`. Hence, only > the > original MBCS should be scanned for path separators, which are `/` > (U+002F) > and `\` (U+005C), but nothing else. > > 2. It's necessary to convert it to a WCS first. Ideally the caller should > have > set the global locale... but are they aware of it? Maybe this should > be done > via `MultiByteToWideChar()`, like other functions. > > 3. So what is the conversion for? We can't make use of the output of such > conversion, but it gives information about how many bytes that a > character > takes in the original MBCS, so we can know how many bytes to move > forward (a > multiple-byte character is never a path separator), and will not a > mistake > non-leading byte as a path separator. > > > > -- > Best regards, > LIU Hao > > _______________________________________________ Mingw-w64-public mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/mingw-w64-public
