After having made a close look at IDNA and stringprep I see many problems in the handling of characters in domain names.
- In IDNA it says that in domain names more than one dot (full stop) must be recognized as label separator. While this might be a natural thing to do in some cases, it is much cleaner if just U+002E is allowed. It simplifies parsing a lot and is more like programmes are used to. Also there are many more dots in UCS that could as well be used. In general I can see three basic contexts for domain names: 1) free text 2) standard form in protocols 3) comparing form. In 1) free form you could write the text any way you like. In 2) the form must be normalised. Here only one "dot" should be allowed to separate labels in a domain name so one one well defined character can separate labels. Stringprep/nameprep/idna defines one normalisation that do not fit to use here. Its NFKC, lower casing and character mapping does destroy to much of the original name. NFKC has several mappings, those related to letters are mostly well but those related to other types of characters like symbols or accents are doubtfull. If symbols and accents should be allowed in domain names, they should only use NFC. NFKC is irregular in its handling of accents, some are expanded from the compact form and some are retaind in the compact form. I would recommend NFC to be used, probably with some of the elementary compatibility mappings for letters added (like U+212B to U+00E5 and all fullwidth letters to standard width). In 3) I think Stringprep/nameprep/idna goes to far. Domain names should be compared case insesitivly using simple case folding instead of full+some Turkish folding as of IDNA today. For example small letter sharp s should not be folded into ss. This would simplify domain name matching a lot and make it quicker and easier to implement. ( though we need a way to do approximative matching in programs interacting with the user. In this case small letter sharp s could be matche to ss, and other complex matching rules be used. But this should not be the normal way for DNS ). Is it good that IDNA goes to far and make people think this is the way to go? It will make it difficult to fix later on. Dan
