"tsenglm@????????????.??????.tw" <[EMAIL PROTECTED]> wrote:
> Unicode [UNICODE] is a coded character set containing tens of thousands > of characters. A single Unicode code point is denoted by "U+" followed > by four to six hexadecimal digits... > > My question are: > Q1: U+hhhh can be represented as u+hhhh or not ? The Unicode standard always uses U+, never u+, and the same is true of the IDNA draft. The Punycode draft always uses U+ in the main spec, but the sample implementation uses both U+ and u+ in order to represent the annotation flags, and the examples section likewise uses both U+ and u+ to make it easy to feed the examples into the sample implementation. > Q2: Here U+HHHH is not a hostname , does it MUST be forced to lower > u+hhhh or not in nameprep ? The case of the U is not part of the code point. A code point is just an integer. For example, U+0391 and u+0391 both represent the integer 913 (decimal) which is the code point for uppercase alpha. U+03B1 and u+03B1 both represent the integer 945 (decimal) which is the code point for lowercase alpha. Nameprep always converts uppercase alpha to lowercase alpha (so it would always output 945, never 913), but a nameprep implementation that included support for mixed case annotations would output not only an array of code points but also a parallel array of case flags, and the lowercase alpha (945) would be flagged as "wanting to be uppercase". The flags could be passed along to the Punycode encoder and recovered by the Punycode decoder. The Punycode sample implementation and examples sections use U+03B1 to mean "lowercase alpha with flag set (wants to be uppercase)" and use u+03B1 to mean "lowercase alpha with flag clear (wants to stay lowercase)". The flags have no affect on which ASCII letters and digits are output by the Punycode encoder. The flags merely affects the upper/lowercase property of the ASCII letters. AMC
