Can we just avoid converting to wide char at all and operate only in
MBCS? IsDBCSLeadByte should be enough to allow these functions to skip
any false matches on the second byte of double-byte chars. And it does
not matter that IsDBCSLeadByte doesn't work with UTF-8, because the
UTF-8 encoding already ensures that there will be no false matches with
7-bit ASCII chars (all bytes forming multi-byte chars have the MSB set,
unlike some DBCS).
On 23/3/2023 13:29, LIU Hao wrote:
在 2023-03-22 13:40, 傅继晗 写道:
Hello:
There is no need to convert multi-byte characters to wide-byte
characters and then convert from wide-byte to multi-byte, just deal
with multi-byte directly as __xpg_basename in gnu as this:
glibc/xpg_basename.c at master · lattera/glibc (github.com)
<https://github.com/lattera/glibc/blob/master/stdlib/xpg_basename.c>
Converting multi-byte characters to wide-byte characters can lead to
garbled code problems if the incoming filename encoding does not
match the system default encoding returned by GetACP().And
environment variables do not work there.
I was also thinking about reimplementation of these functions, but
there are things that we must take care of:
1. The conversion from multibyte character strings (MBCS) to wide
character
strings (WCS) is necessary, because non-leading bytes in some MBCS
encodings
may match the black slash `\` (U+005C).
2. Not only forward and backward slashes are path separators. The
conventional
Shift JIS encoding has the Yen symbol `¥` (U+00A5) take the place
of the
ASCII backslash, and a Yen symbol is displayed for the byte 0x5C if
a Shift
JIS locale is activated. This means that `mbstowc()` will convert
the string
"a¥b" (hex: 61 5C 62) to L"a¥b" (hex: 0061 00A5 0062) and we have
to also
accept `¥` as a path separator in Japanese locales.
3. Something similar to 2 happens about the Won symbol `₩` (U+20A9) in
Korean
locales, so we have to accept it, too.
4. Don't ever try to modify the global locale, due to thread safety.
OK, it has been too much. How do we implement this correctly? First we
need to make an assumption about the input, for example, let's assume
it is a path in Shift JIS encoding.
1. Don't check wide characters for path separators! As explained
above, not only
slashes are path separators. Japanese and Korean are what I happen
to know,
but there could be a couple more. We notice that the byte 5C is
always a path
separator, no matter what it may map to, `\`, `¥` or `₩`. Hence,
only the
original MBCS should be scanned for path separators, which are `/`
(U+002F)
and `\` (U+005C), but nothing else.
2. It's necessary to convert it to a WCS first. Ideally the caller
should have
set the global locale... but are they aware of it? Maybe this
should be done
via `MultiByteToWideChar()`, like other functions.
3. So what is the conversion for? We can't make use of the output of such
conversion, but it gives information about how many bytes that a
character
takes in the original MBCS, so we can know how many bytes to move
forward (a
multiple-byte character is never a path separator), and will not a
mistake
non-leading byte as a path separator.
_______________________________________________
Mingw-w64-public mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/mingw-w64-public
_______________________________________________
Mingw-w64-public mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/mingw-w64-public