From: "dean gaudet" <[EMAIL PROTECTED]>
Sent: Sunday, February 25, 2001 7:42 PM
> i'm a bit of an I18N novice, but doesn't it all just magically work if you
> use UTF-8 encoding everywhere?
>
> UTF-8 deliberately avoids using \0 and / in the encodings. plain ascii
> works unmodified. unix filesystems generally support UTF-8 directly
> (because of the \0 and / avoidance).
>
> this allows you to have a single API which understands unicode on all
> platforms -- you don't need to have _u versions which take unicode
> strings.
You are understanding exactly what I proposed with APR_HAS_UNICODE_FS.
My only small change is a way to get config directives in with wchar
support. Since Win32 has no utf-8 editor, I'm working out the patch
to recognize the lead word of a unicode stream and switch to unicode
to utf-8 conversion. Even notepad on Win32 supports unicode files, so
this becomes a no-brainer for administrators.
> give this page a perusal: http://www.cl.cam.ac.uk/~mgk25/unicode.html
I especially liked a comment from http://www.cl.cam.ac.uk/~mgk25/unicode.html#linux
a.. External file system drivers such as VFAT and WinNT have to convert file name
character encodings. UTF-8 has to be added to the
list of already available conversion options, and the mount command has to tell the
kernel driver that user processes shall see
UTF-8 file names. Since VFAT and WinNT use already Unicode anyway, UTF-8 has the
advantage of guaranteeing a lossless conversion
here.
My key concept is _lossless_. All SomeWin32FunctionA() variants are lossy, and
their encoding doesn't correspond to MS's own clib [we can comment on their lack
of brain cells here ... but we won't.] All SomeWin32FunctionW() variants are
not only lossless, but faster. Obviously we replace their conversion cycles
from local code page to unicode with our own utf-8 to unicode functions, but that
shouldn't (if I succeeded) add any net CPU cycles.
Of course they don't correspond to the clib functions [e.g. - consider strlen()]
but we are damned if we do... damned if we don't. mod_autoindex obviously needs
to see APR_IS_UNICODE_FS and adjust the width accordingly. We will get there, but
we aren't there yet.
If we support the native narrow characters we need an effective API to do so
[should we use the current ansi code page or the current oem code page?] We didn't
have a respectable design, and this change made all those other issues mute.