Re: [Mingw-w64-public] [PATCH] rewrite the dirname.c and basename.c without wide character processing

傅继晗 Wed, 22 Mar 2023 23:42:59 -0700

Thanks for your attention! I really understand this time.But I still have a
question,why dirname in linux glib doesn't process  MBCS? Maybe they think
input as UTF-8 as default?
LIU Hao <[email protected]> 于2023年3月23日周四 13:30写道：


> 在 2023-03-22 13:40, 傅继晗 写道:
> > Hello:
> > There is no need to convert multi-byte characters to wide-byte
> characters and then convert from
> > wide-byte to multi-byte, just deal with multi-byte directly as
> __xpg_basename in gnu as this：
> > glibc/xpg_basename.c at master · lattera/glibc (github.com)
> > <https://github.com/lattera/glibc/blob/master/stdlib/xpg_basename.c>
> > Converting multi-byte characters to wide-byte characters can lead to
> garbled code problems if the
> > incoming filename encoding does not match the system default encoding
> returned by GetACP().And
> > environment variables do not work there.
>
> I was also thinking about reimplementation of these functions, but there
> are things that we must
> take care of:
>
> 1. The conversion from multibyte character strings (MBCS) to wide character
>     strings (WCS) is necessary, because non-leading bytes in some MBCS
> encodings
>     may match the black slash `\` (U+005C).
>
> 2. Not only forward and backward slashes are path separators. The
> conventional
>     Shift JIS encoding has the Yen symbol `¥` (U+00A5) take the place of
> the
>     ASCII backslash, and a Yen symbol is displayed for the byte 0x5C if a
> Shift
>     JIS locale is activated. This means that `mbstowc()` will convert the
> string
>     "a¥b" (hex: 61 5C 62) to L"a¥b" (hex: 0061 00A5 0062) and we have to
> also
>     accept `¥` as a path separator in Japanese locales.
>
> 3. Something similar to 2 happens about the Won symbol `₩` (U+20A9) in
> Korean
>     locales, so we have to accept it, too.
>
> 4. Don't ever try to modify the global locale, due to thread safety.
>
>
> OK, it has been too much. How do we implement this correctly? First we
> need to make an assumption
> about the input, for example, let's assume it is a path in Shift JIS
> encoding.
>
> 1. Don't check wide characters for path separators! As explained above,
> not only
>     slashes are path separators. Japanese and Korean are what I happen to
> know,
>     but there could be a couple more. We notice that the byte 5C is always
> a path
>     separator, no matter what it may map to, `\`, `¥` or `₩`. Hence, only
> the
>     original MBCS should be scanned for path separators, which are `/`
> (U+002F)
>     and `\` (U+005C), but nothing else.
>
> 2. It's necessary to convert it to a WCS first. Ideally the caller should
> have
>     set the global locale... but are they aware of it? Maybe this should
> be done
>     via `MultiByteToWideChar()`, like other functions.
>
> 3. So what is the conversion for? We can't make use of the output of such
>     conversion, but it gives information about how many bytes that a
> character
>     takes in the original MBCS, so we can know how many bytes to move
> forward (a
>     multiple-byte character is never a path separator), and will not a
> mistake
>     non-leading byte as a path separator.
>
>
>
> --
> Best regards,
> LIU Hao
>
>

_______________________________________________
Mingw-w64-public mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/mingw-w64-public

Re: [Mingw-w64-public] [PATCH] rewrite the dirname.c and basename.c without wide character processing

Reply via email to