On Fri, May 26, 2006 at 10:40:26AM -0500, Jeremy Nelson wrote: > On Fri, May 26, 2006 at 10:34:47AM -0400, Brian Bruns wrote: > > On Friday, May 26, 2006 10:14 AM [EST], Jeremy Nelson wrote: > > > > > and what support for utf-8 would mean for epic5? I understand all of > > > the principles, but what I don't understand is what things are broken > > > in epic5 because of the lack of support, and then what things are at > > > my disposal to "fix" these broken things. > > > > > > I get a lot of requests to support utf-8, but nobody seems to know > > > what is required, and neither do I. So this call for assistance. > > > > UTF-8 display should be supported by making sure the right LANG and TERM > > are set. Input is another issue. EPIC uses ncurses right? > > No -- epic does everything right off the pty using open() and read() and > select() and so forth. You may ask about epic linking against ncurses, > but that is because ncurses comes with a terminfo implementation which > epic prefers to termcap. > > One of the things people have mentioned with utf8 support is column counting. > It's apparantly no longer valid to assume every byte takes up a column: > sometimes multiple bytes make up one column, and sometimes one byte takes > up multiple columns. How have others solved this? > > Then there's issues like how should epic internally store strings, should > we use utf16 (wchar_t) from c99? And what if someone is not using a utf-8 > terminal emulator, how do i support those?
I think internally you want to use wchar_t, which is not utf-16. It's defined as an int for me, so 32 bit, but it's not even the same as (some variant of) utf-32. wchar_t is just some other repressentation, which might even depend on the current locale. There are c99 functions for most of the things you want to do. For instance wcwidth() will tell you how many columns it will take. Afaik, for all string functions is an equivalent wide string function. So, if you can covert things to a wide string (wchar_t) it should be "easy" to do (basic) unicode support. The only problem you have left is the conversion from one charset to an other. Posix defines iconv(), which allows you to convert one charset to an other. This should be available on most OSs afaik. There is a (LGPL) library (libiconv) that you can use if your OS doesn't support it, and afaik, there are even some alternatives to it. Anyway, let's try to explain what all needs to happen. You have the communication with the IRC server. This should in ideal circumstances always be in UTF-8. But it would be nice if you could fall back to not do it and use the current codeset (see later). If you try and decode UTF-8, and it works, it most likely is UTF-8. If it fails, you fall back. Some users might also want the option to not send in UTF-8, but I don't think we should keep supporting that. Even mIRC now supports UTF-8 and should be able to deal with it. Then you have the communication with the terminal. This should happen in the current codeset. It's determined by the LC_CTYPE locale variable, and you can get it with nl_langinfo(CODESET). It's also what the command "locale charmap" returns. There are several ways to deal with it, one of them is using iconv() to convert to the terminal codeset, and just dump that to the screen. But then you can't use the w* functions, which you might want to use. You can then use something as mbtowc(), to convert it from the current codeset to wchar_t, and then use the w* functions to do your thing, which will in turn do the conversion from wchar_t to terminal codeset again. It might look like converting it from the codeset to wchar_t and wchar_t to the codeset looks like stupid work, but I think you really would want to use those w* functions. Anyway, if you have more questions, feel free to ask. Kurt _______________________________________________ List mailing list [email protected] http://epicsol.org/mailman/listinfo/list
