Greetings all, To kick off the EAI discussion, let's start on a nmh architecture discussion. Specifically: how should i18n characters be represented in the format engine?
I decided to start with the format engine because it's used for a lot of things inside of nmh, and deciding what to do with it really makes other decisions clearer. And also it gets at the fundamental question of how we want to deal with i18n characters. To state the problem more explicitly: right now stuff inside the format engine is assumed to be a mix of ASCII and things in the local character set. We don't really have a way to tag stuff in format strings as being a particular character set. Either we assume the stuff is ASCII, or we magically convert into the native character set (%(decode) does this, for example, and when we retrieve MIME parameters they get magically converted into the native character set). So, in this mostly non-specified space, things kinda mostly sort of work. But now with the existance of message/global, things get a bit more complicated. Specifically, before we could assume pretty much everything in the format engine was ASCII, but now if we get 8-bit characters in the format engine they might be stuff in the native character set (output from %(decode)) or UTF-8. So what should we do? One possible option: convert everything to the local character set when text is input to the format engine. This would basically continue existing practice: strings output from the format engine could be directly output to the user without any additional effort, as we do now. This is relatively simple to implement, as we're mostly doing this now. The downside here is that if a message comes in with unencoded UTF-8 in the headers (it's clear this is where the world is headed) and the user is NOT using a UTF-8 locale, then you have to convert UTF-8 to something else and potentially lose some characters if the target character set doesn't contain the Unicode character. Another option is to simply convert everything to UTF-8 as it gets read into the format engine. I am assuming at this point that Unicode is a superset of all other character sets; assuming this is true, then no information is lost when converting incoming text no matter the character set. However, while this SEEMS like it would be easier, it actually complicates the code quite a bit. The format engine would have to change it's API; since right now the text is in the native character set, we know how many column positions we've consumed and we can stop when we reach the limit. But if the format engine has UTF-8 internally we wouldn't know that the output has reached the character limit (and we can't process this after the fact, since we wouldn't know which characters don't count against the character limit from things like zputlit). We could change the mh-format API to indicate if the resulting text is supposed to be for display and convert it to the native character set, but then that makes me wonder what the value is to the UTF-8 conversion in the first place. Also, this might result in the possibly undesirable state of someone with a ISO-8859-1 locale sending out headers encoded in UTF-8 (although maybe that's not so bad? I am undecided here). It would require some careful work to get it right. Thinking more about it ... hell, I don't know which one is right. I'm open to suggestions here. If you have a better idea, please share it! One final note: Lyndon has suggested that the stdio libraries that are part of Plan 9 might help; I did look at them before, and I do not believe that they will. Specifically, they assume all output is in UTF-8 (because that's how Plan 9 works), but that's not a valid assumption for us. --Ken _______________________________________________ Nmh-workers mailing list [email protected] https://lists.nongnu.org/mailman/listinfo/nmh-workers
