Re: [Mingw-w64-public] [PATCH] rewrite the dirname.c and basename.c without wide character processing

Alvin Wong via Mingw-w64-public Fri, 24 Mar 2023 21:36:27 -0700

Can we just avoid converting to wide char at all and operate only inMBCS? IsDBCSLeadByte should be enough to allow these functions to skipany false matches on the second byte of double-byte chars. And it doesnot matter that IsDBCSLeadByte doesn't work with UTF-8, because theUTF-8 encoding already ensures that there will be no false matches with7-bit ASCII chars (all bytes forming multi-byte chars have the MSB set,unlike some DBCS).


On 23/3/2023 13:29, LIU Hao wrote:

在 2023-03-22 13:40, 傅继晗 写道:
Hello:
There is no need to convert multi-byte characters to wide-bytecharacters and then convert from wide-byte to multi-byte, just dealwith multi-byte directly as __xpg_basename in gnu as this：glibc/xpg_basename.c at master · lattera/glibc (github.com)<https://github.com/lattera/glibc/blob/master/stdlib/xpg_basename.c>Converting multi-byte characters to wide-byte characters can lead togarbled code problems if the incoming filename encoding does notmatch the system default encoding returned by GetACP().Andenvironment variables do not work there.
I was also thinking about reimplementation of these functions, butthere are things that we must take care of:
1. The conversion from multibyte character strings (MBCS) to widecharacter strings (WCS) is necessary, because non-leading bytes in some MBCSencodings
   may match the black slash `\` (U+005C).
2. Not only forward and backward slashes are path separators. Theconventional Shift JIS encoding has the Yen symbol `¥` (U+00A5) take the placeof the ASCII backslash, and a Yen symbol is displayed for the byte 0x5C ifa Shift JIS locale is activated. This means that `mbstowc()` will convertthe string "a¥b" (hex: 61 5C 62) to L"a¥b" (hex: 0061 00A5 0062) and we haveto also
   accept `¥` as a path separator in Japanese locales.
3. Something similar to 2 happens about the Won symbol `₩` (U+20A9) inKorean
   locales, so we have to accept it, too.

4. Don't ever try to modify the global locale, due to thread safety.
OK, it has been too much. How do we implement this correctly? First weneed to make an assumption about the input, for example, let's assumeit is a path in Shift JIS encoding.
1. Don't check wide characters for path separators! As explainedabove, not only slashes are path separators. Japanese and Korean are what I happento know, but there could be a couple more. We notice that the byte 5C isalways a path separator, no matter what it may map to, `\`, `¥` or `₩`. Hence,only the original MBCS should be scanned for path separators, which are `/`(U+002F)
   and `\` (U+005C), but nothing else.
2. It's necessary to convert it to a WCS first. Ideally the callershould have set the global locale... but are they aware of it? Maybe thisshould be done
   via `MultiByteToWideChar()`, like other functions.

3. So what is the conversion for? We can't make use of the output of such
conversion, but it gives information about how many bytes that acharacter takes in the original MBCS, so we can know how many bytes to moveforward (a multiple-byte character is never a path separator), and will not amistake
   non-leading byte as a path separator.





_______________________________________________
Mingw-w64-public mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/mingw-w64-public



_______________________________________________
Mingw-w64-public mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/mingw-w64-public

Re: [Mingw-w64-public] [PATCH] rewrite the dirname.c and basename.c without wide character processing

Reply via email to