On Wed, Dec 26, 2012 at 9:14 PM, Al Johnson <[email protected]> wrote:
> Thanks for the Java API ref. But, I'm curious as to how or where invalid > UTF-8 sequences come about; is it primarily a hacker thing? Most frequently due to buggy bot tools or reaaaally old browsers that didn't support UTF-8 correctly. > I see the Java Character API has an isValidCodePoint() method. Do I > just run each code point through that? > By the time your data is in Java String objects or 'char's it's already been decoded from UTF-8 (8-bit byte stream) into UTF-16 (16-bit character string). I don't remember offhand enough about Java I/O to tell you exactly what class in the input stack is doing that though. :) -- brion _______________________________________________ MediaWiki-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
