Re: [Mingw-w64-public] [PATCH] rewrite the dirname.c and basename.c without wide character processing

LIU Hao Wed, 22 Mar 2023 22:29:55 -0700

在 2023-03-22 13:40, 傅继晗 写道:

Hello:
There is no need to convert multi-byte characters to wide-byte characters and then convert from wide-byte to multi-byte, just deal with multi-byte directly as __xpg_basename in gnu as this： glibc/xpg_basename.c at master · lattera/glibc (github.com) <https://github.com/lattera/glibc/blob/master/stdlib/xpg_basename.c> Converting multi-byte characters to wide-byte characters can lead to garbled code problems if the incoming filename encoding does not match the system default encoding returned by GetACP().And environment variables do not work there.

I was also thinking about reimplementation of these functions, but there are things that we must take care of:


1. The conversion from multibyte character strings (MBCS) to wide character
   strings (WCS) is necessary, because non-leading bytes in some MBCS encodings
   may match the black slash `\` (U+005C).

2. Not only forward and backward slashes are path separators. The conventional
   Shift JIS encoding has the Yen symbol `¥` (U+00A5) take the place of the
   ASCII backslash, and a Yen symbol is displayed for the byte 0x5C if a Shift
   JIS locale is activated. This means that `mbstowc()` will convert the string
   "a¥b" (hex: 61 5C 62) to L"a¥b" (hex: 0061 00A5 0062) and we have to also
   accept `¥` as a path separator in Japanese locales.

3. Something similar to 2 happens about the Won symbol `₩` (U+20A9) in Korean
   locales, so we have to accept it, too.

4. Don't ever try to modify the global locale, due to thread safety.

OK, it has been too much. How do we implement this correctly? First we need to make an assumption about the input, for example, let's assume it is a path in Shift JIS encoding.


1. Don't check wide characters for path separators! As explained above, not only
   slashes are path separators. Japanese and Korean are what I happen to know,
   but there could be a couple more. We notice that the byte 5C is always a path
   separator, no matter what it may map to, `\`, `¥` or `₩`. Hence, only the
   original MBCS should be scanned for path separators, which are `/` (U+002F)
   and `\` (U+005C), but nothing else.

2. It's necessary to convert it to a WCS first. Ideally the caller should have
   set the global locale... but are they aware of it? Maybe this should be done
   via `MultiByteToWideChar()`, like other functions.

3. So what is the conversion for? We can't make use of the output of such
   conversion, but it gives information about how many bytes that a character
   takes in the original MBCS, so we can know how many bytes to move forward (a
   multiple-byte character is never a path separator), and will not a mistake
   non-leading byte as a path separator.



--
Best regards,
LIU Hao

OpenPGP_signature
Description: OpenPGP digital signature

_______________________________________________
Mingw-w64-public mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/mingw-w64-public

Re: [Mingw-w64-public] [PATCH] rewrite the dirname.c and basename.c without wide character processing

Reply via email to