L.M.Tseng asked: > Dear All: > In the draft , > http://www.ietf.org/internet-drafts/draft-ietf-idn-idna-06.txt > define the single Unicode code point as follow: > > Unicode [UNICODE] is a coded character set containing tens of thousands > of characters. A single Unicode code point is denoted by "U+" followed > by four to six hexadecimal digits, while a range of Unicode code points > is denoted by two hexadecimal numbers separated by "..", with no > prefixes.
Patrick, Paul, or Adam may offer further clarification, but this is basically a Unicode nomenclatural issue. The string "U+006A" is a denotation for the Unicode code point (in the overall range of possible values 0..10FFFF), as well as the character encoded at that code point, namely LATIN SMALL LETTER J. The case doesn't matter, although the Unicode Standard most often uses uppercase. So some people would also use "U+006a" or "u+006a" for the same Unicode code point. > > My question are: > Q1: U+hhhh can be represented as u+hhhh or not ? Yes. And you can also just leave off the U+ altogether where it is clear you are referring to Unicode characters, i.e. "hhhh", so for the LATIN SMALL LETTER J, just "006A" or "006a". > Q2: Here U+HHHH is not a hostname , does it MUST be forced to lower > u+hhhh or not in nameprep ? I think you are mixing things up. If you put a Unicode character into a hostname, you don't literally put the string "U+006A" (or whatever) into the hostname, you put the Unicode encoded representation, in whatever form of Unicode you are using, into the hostname. Thus, if my hostname was "jam", in Unicode UTF-8, that would be just 0x6A 0x61 0x6D, since the Unicode values for ASCII characters like "j" are the same as ASCII in UTF-8. If my hostname was the Chinese word for 'banana', just to pick a random example, that consists of two characters (pinyin: xiang1jiao1). The Unicode values for those characters are U+9999 U+8549. If you have a Unicode string, that would just be two 16-bit numbers, 0x9999 followed by 0x8549, if using Unicode UTF-16, or the following byte sequence if using Unicode UTF-8: 0xE9 0xA6 0x99 0xE8 0x95 0x89. > Q3: Puny code draft accept U+hhhh or u+hhhh to let the final encoded > ASCII character (last character of corresponding encoded code point) with > case upper or lower. If I am interpreting things correctly, Punycode is defined on the Unicode code points, and certainly not on the short identifier strings for the Unicode code points. So for the Chinese 'banana' example, you'd be encoding two code point integers (39321 = 0x9999, followed by 34121 = 0x8549), *not* the string of integers corresponding to the ASCII string "U+9999U+8549". Of course, somebody might want to try having a hostname or domain name of "U+9999U+8549", but that is a 12 character ASCII string, and is not the same thing at all as the two-character Unicode string for the Chinese word for 'banana'. --Ken > > I hope draft authors can help to clarify these interconnection point . > > L.M.Tseng
