On Nov 24, 2007, at 6:39 AM, Ben Hood wrote:

> Hi,
>
> I have a question about the length encoding for UTF-8.
>
> The string "\u00c3" is represented as xc3 x83 in UTF.
>
> According to the spec, this should be encoded as x01 xc3 x83.
>
> So it would seem that the length refers to the length of native
> encoding.

It's the length in 16-bit unicode (with surrogate pairs).

> But wouldn't it be more practical for a parser to know the length of
> the UTF-8 payload, i.e. x02 xc3 x83?

Not if the native unicode string is based on 16-bit characters.  With  
Hessian's encoding, the parser can preallocate the character buffer  
before reading the string.  So the current length is a useful value.

The number of bytes isn't actually important at all, since it doesn't  
correspond to any data structure on either end.  You can't do  
anything useful with that value.  (Unless you're using C's 8-bit  
encoding, but no choice is going to be completely efficient for every  
language.)

Also, calculating the number of bytes would require an extra pass  
through the string.

>
> Wouldn't that be more consistent with UTF-8 strings whose characters
> are all 1 byte in UTF, e.g. x05 hello?

I'm not sure what you mean.  The current definition is completely  
consistent.  ASCII happens to have the same number of bytes as  
characters in UTF-8, but that's just a coincidence.

-- Scott

>
> Thx,
>
> Ben
>
>
> _______________________________________________
> hessian-interest mailing list
> [email protected]
> http://maillist.caucho.com/mailman/listinfo/hessian-interest



_______________________________________________
hessian-interest mailing list
[email protected]
http://maillist.caucho.com/mailman/listinfo/hessian-interest

Reply via email to