Thanks Eric. This is definately more comprehensive than what I expected! I was only thinking of a simple summary of the whole series of the discussion of internationalized domain names and hostname so we know how to capture that properly either in Nameprep, ACE or IDNA, DM or any relevant document.
Anyway, this is great stuff, perhaps sufficient to be an draft on its own. But I am not sure how this fits into the IDN WG yet...but something we should discuss further. -James Seng ----- Original Message ----- From: "Eric A. Hall" <[EMAIL PROTECTED]> To: "IDN" <[EMAIL PROTECTED]> Sent: Wednesday, December 05, 2001 4:08 PM Subject: [idn] naming syntax rules > > James has been needling me to put together a second summary of the naming > rules in time for the IETF meeting. However, I have been extremely busy > lately (it is 3 am in a strange hotel room) but I wanted to at least > scratch together enough material for the concepts to be tangible. > > The following text is by no means complete. It's hardly just begun. > However, it illustrates what the scope will be, exposes some of the open > issues, and may be usable as a touchstone to see if this whole IDN thing > is going to work or not. > > What I mean by that is, if systems are going to work with IDNs in their > raw form (not encoded form, but raw form) these are the rules they will > have to work with. If these rules are too complex, the whole approach has > to be reconsidered. That will affect a lot of other things, including ACE. > > The basic idea here is to declare formal data-types for labels, and to > incorporate the data-types into syntaxes for applications and protocols to > use when they need to interact with domain names. > > > 1. Summary > > This memo describes two sets of definitions which are necessary for the > consistent and reliable use of internationalized domain names across the > Internet. First and foremost, this memo specifies the rules which govern > the structure and syntax of internationalized domain names in various > scenarios, and also describes their legitimate characters and any > normalization which may be required. Secondarily, this memo also clarifies > and extends usage rules of common resource records so that > internationalized domain names can be stored and exchanged (either as > resource record owner domain names or as resource record data) in a form > which is consistent across all usage environments. > > > 2. Introduction > > There are many issues which affect the characters that are desirable for > use in DNS domain names. Among these considerations are obvious aspects > such as breadth, as well as less-obvious aspects such as normalized forms > of particular character sequences, comparison efficiencies, and more. > > The general consensus of the IDN working group is that domain names should > use a mildly-restricted subset of the character codes and arrangement > sequences which are documented in the UCS for use with languages, as this > subset excludes non-verbal symbols and spurious punctuation which are > likely to be problematic, while still allowing international domain names > to be created. Furthermore, the consensus is that these character > sequences should be normalized and converted to lowercase [in that order?] > wherever this is possible, since this will provide the tightest > syntactical representation of the supported characters with the least > amount of ambiguity. > > While both of those objectives are highly desirable (and are met in most > of the scenarios), there are many instances where these objectives are > incompatible with existing practice. For example, existing > (STD13-compliant) DNS implementations are allowed to use domain names > which contain any eight-bit character code (0x00 through 0xFF), while > there are some protocol models which specifically require the use of > punctuation (SRV requires underscore, for example), while some resource > records can contain domain names that combine both of these elements (SOA > and RP both provide email addresses as domain name labels that can > contain, and those can use punctuation or case-specific US-ASCII letters). > > In order to facilitate these divergent requirements, this memo describes > multiple types of domain name labels, including their valid characters, > any case-conversions and/or normalizations which may be required, and so > forth. > > Furthermore, in order to ensure that these rules are consistently > implemented (and to minimize damage when they are not), this memo also > states which label data-types are valid for use with many of the common > resource records. > > Cumulatively, this means that a system which attempts to use an > internationalized domain name for a specific purpose will have to be aware > of the rules which govern the resource record which provides that service, > and will have to be aware of the rules which govern the domain name > data-types which are valid for that resource record. For example, if an > application knows that an internationalized domain name will be used for a > forward lookup, it will have to be aware of the label data-types that are > usable with A (or AAAA) resource records, and must ensure that the domain > name is processed (normalized and lower-cased, in this example) before it > is used. > > NOTE: Legacy systems which use a backwards-compatible encoding scheme for > access to resources with internationalized domain names will not be > required to perform any of these tests. However, systems which embrace > internationalized domain names as specific data (EG, any system which > encodes or decodes an internationalized domain name as explicit data) will > need to be aware of these issues and will likely be required to enforce > their usage. > > > 3. Domain Names and Label Data-Types > > An internationalized domain name is a sequence of labels which are > encapsulated in a message. The message may provide the labels as separate > units of data (as is the case with DNS), or may provide them as a series > of dot-separated textual strings (as is the case when domain names are > "written-out" in protocol or application data streams). > > In global terms, an internationalized domain name has the following > characteristics: > > * Series of labels (1*label) > > * Maximum cumulative length of 255 UCS character codes (not necessarily > codes with matching characters, and most definitely not octets or any > encoded representation). This limit includes any separators which may be > provided (such as the full-stop character commonly used as a separator > when the domain name is written), and also includes one character for the > root domain (the trailing dot). > > The labels that make up a domain name will vary according to the > contextual use of the domain name. > > > 3.1. Opaque Labels > > Some functions can use domain names which consist of unstructured or > unknown labels. For example, a TXT resource record can describe anything, > and as such, it can use any sequence of UCS characters for its owner > domain name. > > Opaque labels require no processing on the part of the application which > is using the domain name. It is the responsibility of the user to provide > the domain name to the application in its correct case and/or > normalization form. > > Opaque labels have the following characteristics: > > * Any valid UCS character code (not necessarily a valid UCS character). > > * Minimum length of one UCS character code. > > * Maximum length of 63 UCS character codes. > > NOTE: Even though a domain name may sometimes consist of a variable number > of opaque labels, most domain names will also contain at least some host > labels. In those cases, the entire domain name should be provided as a > series of opaque labels, and the host labels should be determined > beforehand. For example, a CNAME resource record can reference anything, > including an A RR that consists entirely of host labels, or a TXT RR that > consists of a mixture of opaque and host labels. As such, it will depend > on the formats in use by the alias target, and will inherit those > attributes. > > > 3.2. Host Identifier Labels > > Most functions will use domain names to identify a host, either directly > or indirectly. For example, a host may be identified by a relative domain > name which consists of only a local label, or by an FQDN which contains a > series of host labels. Since all forms must be supportable, all namespace > delegation functions also use the host label syntax. > > The UCS characters provided in host labels are required to be converted to > lowercase and normalized according to the rules in [nameprep] before they > are processed. Servers are likely to treat such labels as exact matches of > the encoded data, so it is imperative that applications perform this work > before they encode the label into a DNS query. > > Host labels are used for any lookups, protocol actions, or message formats > which specifically make use of internationalized domain names for host > identification purposes. > > Host labels have the following characteristics: > > * UCS characters from the following ranges: > > "letters" [need a property] > > characters with number property [?] > > characters with diacritical mark property [?] > > hyphen-minus (U+002D) > > * MUST be converted to lowercase according to [nameprep]. > > * MUST be normalized according to [nameprep]. > > * First and last characters in the label MUST NOT be a diacritical mark or > hyphen-minus. > > * Minimum length of two characters. > > * Maximum length of 63 characters. > > > 3.3. ASCII Labels > > Some functions require labels that contain extended punctuation, but which > also require case-neutral comparisons. The most readily apparent of these > usages is the SRV resource record, which makes use of the underscore > character (U+005F) and case-neutral US-ASCII in the owner labels. > > ASCII labels have the following characteristics: > > * Any printable character from US-ASCII (0x21 through 0x7E, inclusive). > > * SHOULD be converted to lowercase as specified in [nameprep] (note that > servers are required to perform case-neutral comparisons, but certain > tools will likely prefer to generate and use lower-case wherever possible, > so lowercase is the preferred form). All comparison operations on these > domain names MUST be performed in a case-neutral form. > > * Minimum length of one character. > > * Maximum length of 63 characters. > > NOTE: some resource records may define tighter restrictions. > > NOTE: Even though a domain name may sometimes consist of a variable number > of ASCII labels, most domain names will also contain at least some host > labels. In those cases, the entire domain name should be provided as a > series of opaque labels, and the ASCII and host labels should be > determined beforehand. > > > 3.4. Mailbox Labels > > Some functions provide SMTP mailboxes as labels within domain names. For > example, the SOA and RP resource records both provide email addresses, > with the first label providing a mailbox (local-part) of the address, and > with the remainder of the labels providing the delivery domain of the > address. > > In order for these resources to be accessible, applications must process > labels which are known to contain email addresses through these rules. > This means that data must be provided in a non-normalized, non-lowercased > form, and must be restricted to the range of characters which are valid, > as specified in section XX of RFC 2822. Until RFC 2822 is deprecated or > until such a time as UCS characters can be stored in the mailbox portion > of Internet standard email addresses, the mailbox label is to processed > according to the rules set forth in RFC 2822. > > There are two additional rules which govern this data-type: > > * Minimum length of one character. > > * Maximum length of 63 characters. > > NOTE: mailbox labels can contain a large number of special characters such > as spaces or full-stop. These characters may require escaping as described > in section XX of this document. > > NOTE: Mailbox labels are NOT a subset of the ASCII labels. Mailbox labels > are case-sensitive, while ASCII labels are case-neutral. > > > 4. Resource Records > > The following structure is used to describe resource records and their > usage of internationalized domain names and labels. > > <owner domain name labels> <mnemonic> <[data] [data] [...]> > > A, always provides a host identifier > > <1*host> <A> <[IPv4 address]> > > > AAAA, always provides a host identifier > > <1*host> <AAAA> <[IPv6 address]> > > > CNAME, can reference anything, can target anything > > <1*opaque> <CNAME> <[1*opaque]> > > > NS, references a host, provides a host identifier > > <1*host> <NS> <[1*host]> > > > SOA, references a host (delegation), provides host identifier, email > address, and custom data > > <1*host> <SOA> <[1*host] [1mailbox (*host)] [serial] [refresh] [retry] > [expire] [ttl]> > > > WKS, always provides a host identifier > > <1*host> <WKS> <[XX] [XX]> > > > PTR, can reference anything, must inherit target attributes > > <1*opaque> <PTR> <[1*opaque]> > > > HINFO, references a host, provides RR-specific data > > <1*host> <HINFO> <[hardware] [opsys]> > > > MX, references a host, provides a preference and a host identifer > > <1*host> <MX> <[preference] [1*host]> > > > TXT, can reference anything, provides free-text data > > <1*host> <TXT> <[text]> > > > RP, can reference anything, provides email address and a pointer to a TXT > RR > > <1*opaque> <RP> <[1mailbox (*host)]> <1*opaque> > > > SRV, references a protocol (which is specified using the ASCII data-type), > provides preference values and a host identifier > > <1*ASCII> <SRV> <[priority] [weight] [port] [1*host]> > > [NOTE: cannot define <2ASCII *HOST> because not all SRV protocol labels > are just _service._transport] > >
