On Sat, Feb 24, 2007 at 02:57:44PM +0200, Pavel Tsekov wrote: > I'd like to initiate a discussion on how to make MC > unicode deal with multibyte character sets.
Hi, Here are some of my thoughts: - First of all, before doing any work, this is a must read for everyone: http://joelonsoftware.com/articles/Unicode.html One of the main points is: from the users' point of view, it absolutely doesn't matter what bytes there are, the only thing that matters is that the users should see every _letter_ correctly on the display. Byte sequences must always be converted accordingly. On the other hand, we'll see that it's often a must for mc to keep byte sequences unchanged. The other main point is: for _all_ the byte sequences, inside mc, in the config and history file, in the vfs interface, everywhere, you _must_ know in which character set the string stands there. - Currently KDE has much more bugs with accented filenames than Gnome has. This is probably because they have a different philosophy. Gnome treats filenames as byte sequences (as every Unix does) and only converts them to characters for displaying purposes; while KDE treats them as character sequences (QString or something like that). Probably due to this, KDE has a lot of troubles, it is absolutely unable to correctly handle filenames that are invalid byte sequences according to the locale, and it often performs extra, erroneous conversions. So I think the right way is to internally _think_ in byte sequences, and only convert it to/from characters when displaying them or doing regexp matches and so on. - Similar goes for file contents. Even in a UTF-8 environment, people want to display (read) and edit files with different encoding, and even if every text file used UTF-8 there would be other (non text) files too. We shouldn't drop support for editing binary files, hex editor mode and so on. - When the author of the well-known text editor "joe" began to implement UTF-8 support, I helped him with advices and later with bug reports. (He managed to implement a working version 2 weeks after he first heard of UTF-8 :-)) The result is IMHO a very well designed editor and I'd prefer to see similar in mcview/mcedit. In order to help people immigrate from 8-bit charset to UTF-8, and in order to be able to view older files, it's important to support different file encoding and terminal charset. For example, it should be possible to view a Latin-1 file inside a Latin-1 mc, to view an UTF-8 file in a Latin-1 mc (replacing non-representable characters with an inverted question mark or something like that), to view a Latin-1 file in an UTF-8 mc, and to view an UTF-8 file in an UTF-8 mc. - The terminal charset should be taken from nl_langinfo(CODESET) (that is, the LANG, LC_CTYPE and LC_ALL variables) and (as opposed to vim) I do believe that there should be _no_ way to override it in mc. No-one can expect correct behavior from any terminal application if these variables do not reflect the terminal's actual encoding, so it's the users' or software vendors' job to set it correctly, there should be no reason why anyone may want to fix it only in one particular application. MC is not the place to fix it, and once it's fixed outside mc, mc should not provide an option to mess it. (I have no experience with platforms that lack locale support, in such platforms it might make sense to create a "terminal encoding" option, the need for this could be detected by the ./configure script.) - The file encoding should probably default to the terminal encoding, but should be easily altered in the viewer or editor (and in fact, some auto-detection might be added, e.g. if the file is not valid UTF-8 then automatically fall back to the locale's legacy charset, or automatically assume UTF-8 if the file is valid. Joe does have two boolean options whether to enable these two ways of auto-guessing file encoding.) This setting alters the way the file's content is interpreted (displayed on the screen, searched case insensitively etc.) and alters how the pressed keys are inserted in the file, but does not alter the file itself (i.e. do not perform iconv on it). This way the editor remains completely binary-safe. Obviously displaying the file requires conversion from the file encoding to the terminal encoding; interpreting pressed keys requires conversation in the reverse way). - Currently mc with the UTF-8 patches have a bug: when you run it in UTF-8 environment and copy a file whose name is invalid UTF-8 (copy means F5 then Enter) then the file name is mangled: the invalid part (characters that are _shown_ as question marks) are replaced with literal question marks. Care should be taken to always _think_ in bytes and only convert to characters for displaying and similar purposes, so that the byte sequences always remain the same. - In UTF-8, the "size" (memory consumption), "length" (number of Unicode entities) and "width" (width occupied in the terminal) are three different notions. The difference between the first two are trivial. The third is different since there are zero-width characters (e.g. combining accents, used e.g. in MacOS accented filenames) and double-width characters (CJK) too. I think it is a must to correctly handle them and it should not be hard. I highly recommend that the often-misunderstood Hungarian Notation be used ( http://joelonsoftware.com/articles/Wrong.html -- read it!), so that for all function and variables that somehow handles any of these three, it is reflected by its name whether it stores size, length or width. Currently a lot of CJK-related bugs originate from not distinguishing between length and width. For example there's a function called mbstrlen() that returns the width and not the length -- this _must_ be fixed ASAP. It's a good question whether to support more complicated languages, e.g. right-to-left writings, I'm not aware of the technical issues that arise here. - vfs specification might need a major review. It should be decided and clearly documented what character set to use. I think there are two possible ways: always UTF-8, or use the locale settings. However, in both cases, invalid byte sequences should be tolerated. The story gets a little bit more complicated, as there are e.g. file system types where filenames are encoded in one particular encoding. For example, Windows filesystems always use UTF-16. Let's suppose you use Latin1 locale and enter a Joliet (non-RockRidge) .iso image and copy files out of it. The filename _must_ be converted since UTF-16 is not usable on Unices. The user expects the software (mc+vfs) to convert it to Latin1 since this is his locale and most likely all other files have accents in this locale. But this conversion might fail due to unrepresentable characters. What to do in this case? Probably the best way is to imitate the kernel's behavior when you mount such an image with iocharset=iso-8859-1 and try to do the same operation. It is one particular error code, I think. And how to list the contents of that directory? Nice questions... Since it's quite unlikely that all software the vfs plugins invoke are able to handle this situation in the same consistent way, my guess is that it's cleaner to force UTF-8 in the vfs communication, and then a Latin-1 mc can handle invalid entries it receives from the vfs plugin. (Just a side note: once the vfs interface is cleaned up, it's time to revisit other issues, e.g. 32-bit (64-bit??) UID/GID, >2GB files, nanosecond timestamp resolution etc... - are all these supported?) - Currently mc supports both ncurses and slang backend via a common wrapper. Both libraries support Unicode, but in a different way: slang works with UTF-8 while ncurses works with wchar (practically UCS4). If only lower level ncurses routines are used, UTF-8 can be used too, I don't know if this is the case in mc. Someone experienced with mc's internals should examine whether dropping support for one of these libraries would save noticable developer resources or not. At this moment the resources to develop mc are IMHO much tighter than the resources on any site where either ncurses or slang has to be installed in order to install mc. So if only keeping support for one of these libraries would save work, I believe this is the way to go. Which library to support is a good question; a long time ago I wrote an e-mail here about my opinions on this. -- Egmont _______________________________________________ Mc-devel mailing list http://mail.gnome.org/mailman/listinfo/mc-devel