Yes, I totally agree with that. In that way, the UTF-8 filename would be processed without errors when the system default is DBCS locale.
Alvin Wong <[email protected]> 于2023年3月25日周六 12:35写道: > Can we just avoid converting to wide char at all and operate only in > MBCS? IsDBCSLeadByte should be enough to allow these functions to skip > any false matches on the second byte of double-byte chars. And it does > not matter that IsDBCSLeadByte doesn't work with UTF-8, because the > UTF-8 encoding already ensures that there will be no false matches with > 7-bit ASCII chars (all bytes forming multi-byte chars have the MSB set, > unlike some DBCS). > > > On 23/3/2023 13:29, LIU Hao wrote: > > 在 2023-03-22 13:40, 傅继晗 写道: > >> Hello: > >> There is no need to convert multi-byte characters to wide-byte > >> characters and then convert from wide-byte to multi-byte, just deal > >> with multi-byte directly as __xpg_basename in gnu as this: > >> glibc/xpg_basename.c at master · lattera/glibc (github.com) > >> <https://github.com/lattera/glibc/blob/master/stdlib/xpg_basename.c> > >> Converting multi-byte characters to wide-byte characters can lead to > >> garbled code problems if the incoming filename encoding does not > >> match the system default encoding returned by GetACP().And > >> environment variables do not work there. > > > > I was also thinking about reimplementation of these functions, but > > there are things that we must take care of: > > > > 1. The conversion from multibyte character strings (MBCS) to wide > > character > > strings (WCS) is necessary, because non-leading bytes in some MBCS > > encodings > > may match the black slash `\` (U+005C). > > > > 2. Not only forward and backward slashes are path separators. The > > conventional > > Shift JIS encoding has the Yen symbol `¥` (U+00A5) take the place > > of the > > ASCII backslash, and a Yen symbol is displayed for the byte 0x5C if > > a Shift > > JIS locale is activated. This means that `mbstowc()` will convert > > the string > > "a¥b" (hex: 61 5C 62) to L"a¥b" (hex: 0061 00A5 0062) and we have > > to also > > accept `¥` as a path separator in Japanese locales. > > > > 3. Something similar to 2 happens about the Won symbol `₩` (U+20A9) in > > Korean > > locales, so we have to accept it, too. > > > > 4. Don't ever try to modify the global locale, due to thread safety. > > > > > > OK, it has been too much. How do we implement this correctly? First we > > need to make an assumption about the input, for example, let's assume > > it is a path in Shift JIS encoding. > > > > 1. Don't check wide characters for path separators! As explained > > above, not only > > slashes are path separators. Japanese and Korean are what I happen > > to know, > > but there could be a couple more. We notice that the byte 5C is > > always a path > > separator, no matter what it may map to, `\`, `¥` or `₩`. Hence, > > only the > > original MBCS should be scanned for path separators, which are `/` > > (U+002F) > > and `\` (U+005C), but nothing else. > > > > 2. It's necessary to convert it to a WCS first. Ideally the caller > > should have > > set the global locale... but are they aware of it? Maybe this > > should be done > > via `MultiByteToWideChar()`, like other functions. > > > > 3. So what is the conversion for? We can't make use of the output of such > > conversion, but it gives information about how many bytes that a > > character > > takes in the original MBCS, so we can know how many bytes to move > > forward (a > > multiple-byte character is never a path separator), and will not a > > mistake > > non-leading byte as a path separator. > > > > > > > > > > > > _______________________________________________ > > Mingw-w64-public mailing list > > [email protected] > > https://lists.sourceforge.net/lists/listinfo/mingw-w64-public > _______________________________________________ Mingw-w64-public mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/mingw-w64-public
