The previous version (for some reason) was mangled into 
quoted-unprintable so I am sending it again this time with smaller 
lines so it won't be completely unreadable.


This is a discussion that is strictly academic at this time.  Please 
do not think that we are close to supporting any of this stuff.  
Rather, I want to start a dialog about how people want this to work, 
so when we do go ahead and design it, people will not think it came 
out of left field.  

So the question on the table is string encodings. 

The input line: Now right now, epic doesn't handle encoding on the 
input line -- it just assumes that each byte is one code point.  For 
people using utf-8 one keypress may yield one codepoint may yield 
multiple bytes, which show up as multiple (incorrect) bytes in the 
input line rather than the key pressed.  Column counting is not 
broken /as such/.

The display: Right now, epic doesn't handle encoding on the output 
display.  Any bytes received are just sent to the display, so if 
you output a utf-8 string on an utf-8 emulator, it will show up 
correctly, and if you output a utf-8 string on a iso-8859-* emulator, 
it will yield multiple (incorrect) characters.  Column counting is 
(of course) broken.

The servers: Globally, the user can /set translation which converts 
between the code points from one 8-bit character set (usually ascii) 
into another 8-bit character set that the server is using.  This is 
fine, as long as both the user and the server are using 8-bit code 
points (which is not the case for utf-8, obviously).

Channel Names: Channel names can be encoded in any encoding.  A 
channel name like #fr=E3nd could be encoded in iso-8859-1 and take 
up 6 bytes, or the channel name could be encoded in utf-8 and take 
up 7 bytes.  The irc server will treat these as separate channels, 
***so it's fundamentally important to be able to specify an encoding 
when specifying a channel name.***

Channel messages: People who chat on the channel may (or may not) use 
any encoding at any time, but usually everyone uses the same encoding, 
which ***may or may not be the same encoding as the channel name 
itself***. 

For example, the channel name may be encoded in iso-8859-1 and the 
users may agree to use utf-8.  ***so it's fundamentally important 
to be able to specify a different encoding for privmsgs on the channel 
than is used to specify the encoding of the channel name itself.***

THEREFORE,
We're going to have to start thinking about syntax for how to specify
all this stuff on a per-channel, per-server basis.  As a wild example,
we could prefix channel names with encoding, using invalid-for-channel
characters.

Example:
        /join (iso-8859-1)#fr=F6nd
(join the channel, encoding the channel name in iso-8859-1)

        /join (utf-8)#fr=F6nd)
(join the channel, encoding the channel name in utf-8)

        /join (iso--8859-1/utf-8)#fr=F6nd
(the channel name is encoded in iso-8859-1, but privmsgs will be 
encoded in utf-8)

The last thing I want to do is support utf-8 but then end up having 
it be half-assed and make everyone think i'm a clod for not thinking 
of every last important detail to take care of.  So now is the time 
to tell me what's really important for supporting a multi-encoding 
irc client!

Thanks for your discussion!
Jeremy
_______________________________________________
List mailing list
List@epicsol.org
http://epicsol.org/mailman/listinfo/list

Reply via email to