> From: Mark H Weaver <m...@netris.org> > Cc: l...@gnu.org, guile-devel@gnu.org > Date: Mon, 24 Feb 2014 13:33:56 -0500 > > >> For example, if it is true that we can avoid all the nasty problems with > >> filename encoding by using the native Windows APIs that use UTF-16 for > >> filenames, then I'd be in favor of doing that. > > > > What nasty problems do you have in mind? I implemented something like > > this in Emacs not long ago, so perhaps I can help here. > > The nasty problem is that POSIX uses sequences of bytes for filenames, > although conceptually filenames are character strings, and in fact > virtually all user interfaces treat filenames as character strings. > > Guile uses character strings for filenames (the only sane thing to do), > and it would be good to build Guile on system APIs that also use > character strings for filenames, instead of having to guess how to > encode the characters into bytes. > > We don't have a fully satisfactory solution to this problem on POSIX, > but I guess we do on Windows, if we use the native Windows APIs. > > BTW, the same problems exist for command-line arguments, environment > variables, the hostname, etc. All of these are sequences of bytes in > POSIX, but conceptually they should be character strings. > > If you'd like to work on a patch to have Guile use the native Windows > APIs (that use UTF-16) for these things, I think that would be very > useful and worthy of inclusion.
This issue needs to be carefully designed first. File names are easy, as long as Guile and the OS are concerned. Environment variables and command-line arguments likewise. But once you need to display those file names or variables, or ask the user to type them, there are problems that don't have good solutions yet, at least not in Guile applications that use the text terminal for display. First, you need to bypass the usual stdio output routines and use special APIs. And after you've done that, you'll bump into the fact that Windows console devices are limited in their ability to support Unicode characters outside of the system locale; basically anything beyond European scripts is not supported. (Emacs avoids this problem because its usual UI is a graphical one, where fonts and layout engines are available that support almost any script in existence.) Likewise for keyboard input: typing non-ASCII text into the Windows console outside of the current console codepage is a tricky business; basically, you need to completely bypass the "normal" stdio functions and use Windows specific console APIs and Windows input methods. There's also the issue of invoking other programs with arguments that include Unicode characters. Most programs that Guile will invoke on Windows do not support that, they are "normal" console programs that only support characters encoded in the current console codepage. Windows will transparently convert from Unicode to the codepage encoding, but if there are characters outside of that codepage, they will be omitted or replaced by placebos, which might cause strange failures. There are also complications when calling functions from external libraries that accept file names: those libraries will not normally support Unicode characters in file names. But this problem can be solved by a known trick of using the 8+3 short aliases of the file names, which use only ASCII characters. So to provide something useful in this department, we need to discuss what portions of Guile it is sensible and practical to convert to Unicode, and how to treat those areas where we won't. I will certainly need some insider's help in this.