On Nov 24, 2007, at 6:39 AM, Ben Hood wrote: > Hi, > > I have a question about the length encoding for UTF-8. > > The string "\u00c3" is represented as xc3 x83 in UTF. > > According to the spec, this should be encoded as x01 xc3 x83. > > So it would seem that the length refers to the length of native > encoding.
It's the length in 16-bit unicode (with surrogate pairs). > But wouldn't it be more practical for a parser to know the length of > the UTF-8 payload, i.e. x02 xc3 x83? Not if the native unicode string is based on 16-bit characters. With Hessian's encoding, the parser can preallocate the character buffer before reading the string. So the current length is a useful value. The number of bytes isn't actually important at all, since it doesn't correspond to any data structure on either end. You can't do anything useful with that value. (Unless you're using C's 8-bit encoding, but no choice is going to be completely efficient for every language.) Also, calculating the number of bytes would require an extra pass through the string. > > Wouldn't that be more consistent with UTF-8 strings whose characters > are all 1 byte in UTF, e.g. x05 hello? I'm not sure what you mean. The current definition is completely consistent. ASCII happens to have the same number of bytes as characters in UTF-8, but that's just a coincidence. -- Scott > > Thx, > > Ben > > > _______________________________________________ > hessian-interest mailing list > [email protected] > http://maillist.caucho.com/mailman/listinfo/hessian-interest _______________________________________________ hessian-interest mailing list [email protected] http://maillist.caucho.com/mailman/listinfo/hessian-interest
