On Sep 6, 2011, at 10:58 AM, Tobias Oberstein wrote:
>> In contrast, *not* requiring breaking at UTF-8 code points means that clients
>> can't do any meaningful validation on text frames. Which means you might
>> as well get rid of text frames entirely.
>
> Why?
>
> You can do streaming validation of UTF-8 without requiring frame boundaries to
> observe UTF-8 code point boundaries.
>
> In Python you can do that i.e. using
>
> codecs.getincrementaldecoder('utf-8')()
>
> When a frame does not end on code point boundary, one needs to remember
> at most 3 bytes to continue validation on next frame.
If frames are valid utf-8, then you don't need to keep any state (on either end
of the connection).
> It would make sense that a peer SHOULD fail a connection upon invalid UTF-8
> as soon as it is possible - that means with at most 1 frame delay upon the
> start of the byte sequence that was invalid UTF-8.
>
> Anyway: what's the advantage of such an requirement?
The advantage is frame-wise validation instead of message-wise validation. As
you point out, it's not a huge distinction, more "be conservative in what you
send". It just seems unnecessarily sloppy not to have frame boundaries
coincide with code point boundaries.
--Richard
_______________________________________________
Gen-art mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/gen-art