----- Original Message ----- From: "Adam M. Costello" <[EMAIL PROTECTED]> > The Unicode character database classifies each character as belonging to > exactly one of the following broad classes: > > L: letter > M: mark > N: number > P: punctuation > S: symbol > Z: separator > C: other
May I add this? U: unassigned code points. > > We can start by examining which of these classes of ASCII characters are > allowed in ASCII host labels. > > L: 52 exist, all are allowed > M: 0 exist > N: 10 exist, all are allowed > P: 23 exist, only hyphen-minus is allowed > S: 9 exist, none are allowed > Z: 1 exists, it is not allowed > C: 33 exist, none are allowed U: indefinite, all are allowed . > > We can trivially extend these results to form a simple rule covering the > entire Unicode repertoire, except that we have no precedent for class > M. Since characters in class M tend to be things like diacritics, they > should be allowed. So the proposed rule is: > > All characters in classes L (letter), M (mark), and N (number) are > allowed, and U+002D (hyphen-minus) is also allowed. Everything else is > forbidden. U should be also allowed in addition to L,M,N. But in later version of unicode , U may be partitioned into L' ~ C' and smaller U'. Soobok Lee
