On Fri, May 26, 2006 at 10:40:26AM -0500, Jeremy Nelson wrote:
> On Fri, May 26, 2006 at 10:34:47AM -0400, Brian Bruns wrote:
> > On Friday, May 26, 2006 10:14 AM [EST], Jeremy Nelson wrote:
> > 
> > > and what support for utf-8 would mean for epic5?  I understand all of
> > > the principles, but what I don't understand is what things are broken
> > > in epic5 because of the lack of support, and then what things are at
> > > my disposal to "fix" these broken things.
> > >
> > > I get a lot of requests to support utf-8, but nobody seems to know
> > > what is required, and neither do I.  So this call for assistance.
> > 
> > UTF-8 display should be supported by making sure the right LANG and TERM 
> > are set.  Input is another issue.  EPIC uses ncurses right?
> 
> No -- epic does everything right off the pty using open() and read() and 
> select() and so forth.  You may ask about epic linking against ncurses,
> but that is because ncurses comes with a terminfo implementation which 
> epic prefers to termcap.
> 
> One of the things people have mentioned with utf8 support is column counting.
> It's apparantly no longer valid to assume every byte takes up a column:
> sometimes multiple bytes make up one column, and sometimes one byte takes
> up multiple columns.  How have others solved this?
> 
> Then there's issues like how should epic internally store strings, should
> we use utf16 (wchar_t) from c99?  And what if someone is not using a utf-8
> terminal emulator, how do i support those?

I think internally you want to use wchar_t, which is not utf-16.
It's defined as an int for me, so 32 bit, but it's not even the
same as (some variant of) utf-32.  wchar_t is just some other
repressentation, which might even depend on the current locale.

There are c99 functions for most of the things you want to do.  
For instance wcwidth() will tell you how many columns it will
take.  Afaik, for all string functions is an equivalent wide
string function.  So, if you can covert things to a wide string
(wchar_t) it should be "easy" to do (basic) unicode support.

The only problem you have left is the conversion from one charset
to an other.  Posix defines iconv(), which allows you to convert
one charset to an other.  This should be available on most OSs
afaik.  There is a (LGPL) library (libiconv) that you can use if
your OS doesn't support it, and afaik, there are even some
alternatives to it.

Anyway, let's try to explain what all needs to happen.

You have the communication with the IRC server.  This should in
ideal circumstances always be in UTF-8.  But it would be nice if
you could fall back to not do it and use the current codeset (see
later).  If you try and decode UTF-8, and it works, it most
likely is UTF-8.  If it fails, you fall back.

Some users might also want the option to not send in UTF-8, but I
don't think we should keep supporting that.  Even mIRC now
supports UTF-8 and should be able to deal with it.

Then you have the communication with the terminal.  This should
happen in the current codeset.  It's determined by the LC_CTYPE
locale variable, and you can get it with nl_langinfo(CODESET).
It's also what the command "locale charmap" returns.

There are several ways to deal with it, one of them is using
iconv() to convert to the terminal codeset, and just dump that to
the screen.  But then you can't use the w* functions, which you
might want to use.  You can then use something as mbtowc(), to
convert it from the current codeset to wchar_t, and then use
the w* functions to do your thing, which will in turn do the
conversion from wchar_t to terminal codeset again.

It might look like converting it from the codeset to wchar_t and
wchar_t to the codeset looks like stupid work, but I think you
really would want to use those w* functions.

Anyway, if you have more questions, feel free to ask.


Kurt

_______________________________________________
List mailing list
[email protected]
http://epicsol.org/mailman/listinfo/list

Reply via email to