On 03/28/2011 11:29 AM, Mark Martinec wrote:

A UTF-8 -encoded character consists or 1 to 4 octets. The above
regexp makes sure that a truncation point does not occur in the
middle of a single-character encoding, which would produce
a syntactically invalid UTF-8 string (choking SQL, etc).

See sections 3 and 4 of the RFC 3629. Octets forming the trailing 1..3
characters of the UTF-8 sequence all have the topmost bits 10,
i.e. 10xxxxxx (UTF8-tail = %x80-BF), which is why these are
excluded from the set of valid character-starter octets in the
above [\x00-\x7F\xC0-\xFF].


Thanks Mark, I was not processing this portion of it properly as I read through the RFC. Thanks for putting more emphasis on it.

Reply via email to