This is a summary of what I believe to be the options for handling Unicode newsgroup names in an IETF standard. So far, I see three separate viable options for handling encoded newsgroup names through the entire protocol. Those three options are:
(A) UTF-8 in articles and NNTP, punycode in e-mail and IMAP (B) punycode in e-mail, IMAP, and articles, UTF-8 in NNTP (C) punycode everywhere These are expanded in more detail below. Note that in each case some other encoding system besides punycode could in theory be used. I don't believe the choice of encoding changes the remainder of this analysis, however, so punycode is left as a placeholder (and what seems to currently be the most likely choice). I am also making the assumption that a standard requiring e-mail or IMAP to handle unencoded UTF-8 in message headers and in newsgroup names is not a viable option, due to strong oppposition from the e-mail community. Regardless of whether I agree with that opposition or not, I'm uninterested in reopening that discussion, which went on both in usenet-format and in ietf-822 at extended length. This summary does not address internationalization issues in any other headers or information besides Usenet newsgroup names. In this analysis, I will be referring to the following components of the Usenet messaging system: (1) A newsreader posting via NNTP. (2) The NNTP server accepting posts from a client. (3) The NNTP transit server relaying posts to other servers. (4) The NNTP server providing messages to a client. (5) A newsreader reading via NNTP. (6) The NNTP server relaying a message posted to a moderated group. (7) The local mail system of the NNTP server. (8) The mail system of the moderation relay site. (9) The local mail system of the moderator. (10) The software used by the moderator of a newsgroup. (11) A mail to news gateway. (12) A news to mail gateway. (13) An IMAP server serving Usenet messages to a client. (14) An IMAP client reading Usenet messages from an IMAP server. One or another of these proposed options affect every single component of this system except for (11). In the case of (11), none of these three proposals will affect any existing mail to news gateways. Existing mail to news gateways may not be able to handle new non-ASCII newsgroups without modification, but all three proposals are backward-compatible in the sense that all currently working gateways to currently existing groups will continue to function as they do now. Please note that I have separated the moderation process into a separate component from a general mail to news gateway. For new mail to news gateways for new non-ASCII newsgroups, the issues are essentially the same as for posting agents (1). In the case of (14), I am making the assumption that changing the IMAP protocol is not an option. This means that messages served to (14) will not contain unencoded UTF-8 in the headers, and newsgroup names in IMAP will not have unencoded UTF-8 names. All of the work (if any) required to make the articles compatible with an IMAP environment would have to be born by (13). In all of these cases, it will therefore be desirable for the IMAP client to be modified to understand punycode and display the newsgroup names correctly. Note that there is an additional component that is left unmentioned above, namely encoding of newsgroup names in URLs. I don't know enough about this area to comment usefully, but I believe that it's somewhat orthogonal to the remaining issues. (A) UTF-8 in articles and NNTP, punycode in e-mail and IMAP =========================================================== This is Andrew's original proposal. The canonical name of the newsgroup would be in UTF-8 without further encoding. The Usenet article format would be defined to carry UTF-8 newsgroup names without further encoding in those headers that contain newsgroup names (Newsgroups, Followup-To, Control, and Xref). Similarly, the body of control messages for non-ASCII newsgroups would be required to be in UTF-8 and would contain the UTF-8 newsgroup names. NNTP commands would take UTF-8 arguments wherever newsgroup names are referred to. wildmat would be modified to match UTF-8 characters if the server supported the ? or [] wildcards. At every point where a Usenet article must be conveyed via e-mail, specifically (6), (12), and (13), any non-ASCII content in Newsgroups and Followup-To would be encoded in punycode (or some other suitable encoding method). (Control and Xref headers are generally not gatewayed.) The envelope recipient used when sending to the moderation relays (8) would contain the encoded form of the newsgroup name. A moderator (10) who received a post to a non-ASCII newsgroup (either the newsgroup they themselves are modifying or a newsgroup to which the message was crossposted) would, in order to approve the message, have to either decode the newsgroup name to its canonical UTF-8 form again or use an injector (2) that will do this. Otherwise, the article should be rejected. This proposal requires no changes to (7), (8), or (9); in other words, the existing mail transit systems are unaffected by this proposal. It requires only minimal changes to (2), (3), and (4), the existing news transit system, to remove restrictions preventing creation of non-ASCII newsgroups. It is believed that essentially all existing news transit and server systems still in active use can handle 8-bit newsgroup names without difficulties. It would be desirable for (2), the injection agent, to be able to undo the mail encoding automatically. Additionally, a news reader (5) may be able to read such groups without modification if it already has support for 8-bit characters and can be configured appropriately, and similarly a news posting agent (1) may also be able to be used without modification. Updates to (1) and (5) to provide Unicode character entry, canonicalization, and display would of course be extremely desirable. (1) and (5) require no modifications to deal with existing ASCII newsgroups except modifications for 8-bit cleanliness to handle crossposted messages. Moderation software (10) would have to change in order to handle any non-ASCII groups, since the mail encoding would have to be decoded, or the moderator would have to arrange to use an updated injecting agent (2). Moderators of existing ASCII newsgroups who didn't want to deal with this issue could simply reject all articles crossposted to non-ASCII newsgroups. There is some likelihood that messages crossposted between moderated ASCII newsgroups and other (moderated or unmoderated) non-ASCII newsgroups would end up under some circumstances being injected into the news system with the non-ASCII newsgroup names encoded in the mail encoding, with the only damage being that the articles would not show up in the non-ASCII newsgroups that they were intended to be posted to. Any news to mail gateway (12) would have to be modified if it received any messages crossposted to non-ASCII newsgroups and wanted to preserve the Newsgroups header in the e-mail message. Failure to encode the headers appropriately would result in unencoded 8-bit text in the headers of a mail message, where it may be mangled or rejected by the mail system. Any IMAP server processing Usenet messages (13) would have to perform the same transformations, encoding newsgroup names in Newsgroups, Followup-To, and Control (and Xref if the IMAP server wished to maintain it). In addition, the newsgroup name would have to be presented to the client in an encoded form; UTF-7 may be preferrable in this case to punycode. (B) punycode in e-mail, IMAP, and articles, UTF-8 in NNTP ========================================================= This is the intermediate proposal, allowing use of UTF-8 directly in NNTP where it's fairly uncontroversial and continuing to treat the canonical name of the newsgroup as the unencoded UTF-8 form, but always encoding the newsgroup name wherever it occurs in a news article. This maintains complete RFC 2822 compatibility in the article format, unlike (A), but still allows use of UTF-8 in NNTP. Any non-ASCII newsgroup names in Newsgroups, Followup-To, Control, and Xref would be encoded using punycode. For ease of processing and consistency, that probably also means that newsgroup names in the bodies of control messages should also be encoded in punycode. All NNTP commands would take UTF-8 arguments for newsgroup names, and the newsgroup names returned by LIST, GROUP, and similar commands would be in UTF-8. This means that the newsgroup name sent to the server in a GROUP command and the newsgroup name in the Newsgroups and Xref headers would not be the same. While it may still be possible for an extremely sophisticated user to use an unmodified news reader (5) or posting agent (1), it would require the user to override the news client at a multitude of points and would be at best a last-ditch sort of affair, far too clumsy to use for any sustained period. This proposal therefore mandates modifications to (1) and (5) for any user who wants to use non-ASCII newsgroups. If the user only wants to use existing ASCII newsgroups, their existing client software can be used unmodified. It must, however, be able to handle 8-bit newsgroup names returned from the LIST command (but doesn't have to be able to handle 8-bit content in the article headers). NNTP servers (2) and (4) must be modified in order to carry non-ASCII newsgroups to decode the newsgroup headers when receiving messages so as to know what newsgroup into which to file them. The active file would also need to be kept in UTF-8. As above, it is believed that the other NNTP commands besides POST/IHAVE/TAKETHIS would work without modification because existing NNTP software is already 8-bit clean. If the NNTP software is not modified, the newsgroups will show up in their punycode encoded form, possibly confusing compliant news reading software. Transit servers (3) do not need to be modified. For the best support of pattern-based feeds, transit servers will want to decode the newsgroup header as it comes in and then apply wildmat patterns to the decoded form so that wildmat patterns can be specified in UTF-8. The transit servers will continue to function correctly without this modification, however, and news administrators could add additional appropriate patterns to catch the punycode-encoded forms. Presuming that ASCII newsgroup names are not encoded (a reasonable assumption for any encoding format, I believe), the only reason to add punycode support to transit servers would be for the convenience of the administrator in expressing wildmat patterns for non-ASCII newsgroups in an unencoded form. (I believe that the likelihood that a punycode-encoded name would happen to match one of the widely used patterns like *sex* or *mp3* is fairly small, but I could be wrong as I've not done a statistical analysis.) This proposal requires no modifications to the moderation system of (6), (7), (8), (9), and (10) whatsoever, including while handling non-ASCII groups. It similarly requires no modifications to news to mail gateways (12). IMAP servers (13) may want to recode the newsgroup names from punycode to UTF-7, but would not need to make any transformations to the articles themselves. (C) punycode everywhere ======================= The "most encoded" proposal, this proposal says to use punycode everywhere. All newsgroup names in the Usenet articles and via the NNTP protocol would be encoded in punycode and the punycode-encoded version of the newsgroup name would be the canonical one. The name would only be decoded for display purposes in the client software. This maintains complete RFC 2822 compatibility for the article format. This proposal mandates modifications to the posting agents (1) and the news readers (5) in order to properly display the names. No modifications must be made to news readers that are reading only ASCII newsgroups; they will just see a bunch of additional oddly-named newsgroups. Existing news readers could still read and post to non-ASCII newsgroups if they didn't mind the odd names. No modifications are required to (2) or (4), the NNTP servers, although without modifications the server administrator would have to work with encoded group names. It would provide a much better user interface if the administrative tools implemented punycode encoding and decoding for easier handling of non-ASCII newsgroup names. The same issues as with (B) apply to transit servers (3), namely that it would be convenient but not required for transit servers to decode the newsgroup names before doing wildmat matching so that the wildmat patterns could be specified in a convenient format. No modifications are required for the moderation system of (6), (7), (8), (9), and (10) whatsoever, including while handling non-ASCII groups. Similarly, no modifications are required for news to mail gateways (12). IMAP servers (13) may again wish to recode punycode to UTF-7 for newsgroup names, but otherwise require no modification. Summary ======= The following chart summarizes the backward compatibility issues for each proposal and each component of the news system. For each portion of the news system, N means no change required, Y means change is required to correctly handle non-ASCII newsgroups, D means change is very desirable but not absolutely necessary, and C means change would be convenient but unmodified software is still fairly usable. | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 -+------------------------------------------------------ A| D C N N D Y N N N Y N D Y D B| Y Y C Y Y N N N N N N N C D C| D C C C D N N N N N N N C D The above summary I believe correctly indicates that proposal (B) requires the most changes to be made to the news system itself. It's possible to use existing software without modification with either proposal (A) or proposal (C); under proposal (A), news readers that aren't 8-bit clean will break, and some news readers may actually get display right without having to make any modifications, but with other news readers it may be impossible to access a non-ASCII group because no Unicode entry is supported. Under proposal (C), we're guaranteed that nothing will break and that it will always be possible to access even non-ASCII groups, but no existing client will display the names correctly. Overall, it is somewhat less necessary to change client software with (A) than with (C); exactly how much less necessary is something of an open question. Proposal (A) is the only proposal that requires changes to any system outside of the news system (other than changes to an IMAP client to understand the punycode newsgroup names, which are the same for all proposals). Both (B) and (C) work with the moderation, e-mail, and IMAP infrastructure without any additional changes. -- Russ Allbery ([EMAIL PROTECTED]) <http://www.eyrie.org/~eagle/>
