Re: [MediaWiki-l] Normalization Code

Brion Vibber Wed, 26 Dec 2012 23:24:44 -0800

On Wed, Dec 26, 2012 at 9:14 PM, Al Johnson <[email protected]> wrote:


> Thanks for the Java API ref.  But, I'm curious as to how or where invalid
> UTF-8 sequences come about; is it primarily a hacker thing?


Most frequently due to buggy bot tools or reaaaally old browsers that
didn't support UTF-8 correctly.


>   I see the Java Character API has an isValidCodePoint() method.  Do I
> just run each code point through that?
>

By the time your data is in Java String objects or 'char's it's already
been decoded from UTF-8 (8-bit byte stream) into UTF-16 (16-bit character
string). I don't remember offhand enough about Java I/O to tell you exactly
what class in the input stack is doing that though. :)

-- brion
_______________________________________________
MediaWiki-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l

Re: [MediaWiki-l] Normalization Code

Reply via email to